BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

arXiv — cs.CLWednesday, November 12, 2025 at 5:00:00 AM
The recent introduction of BARD10, a balanced benchmark corpus for Bangla authorship attribution, marks a significant advancement in the field. This dataset comprises writings from ten contemporary Bangla authors and serves to systematically analyze the impact of stop-word removal on authorship analysis. The study found that classical models, particularly the TF-IDF + SVM baseline, achieved impressive macro-F1 scores of 0.997 on the BAAD16 dataset and 0.921 on BARD10, outperforming more modern approaches like Bangla BERT, which lagged behind by five points. The findings indicate that authors in the BARD10 dataset are highly sensitive to stop-word pruning, underscoring the stylistic significance of these words in Bangla literature. In contrast, authors in the BAAD16 dataset exhibited greater robustness to such pruning, suggesting genre-dependent reliability. This research not only enhances the understanding of Bangla authorship attribution but also emphasizes the need for finely calibra…
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
How Data Quality Affects Machine Learning Models for Credit Risk Assessment
PositiveArtificial Intelligence
Machine Learning (ML) models are increasingly used for credit risk evaluation, with their effectiveness dependent on data quality. This research investigates the impact of data quality issues such as missing values, noisy attributes, outliers, and label errors on the predictive accuracy of ML models. Using an open-source dataset, the study assesses the robustness of ten commonly used models, including Random Forest, SVM, and Logistic Regression, revealing significant differences in model performance based on data degradation.