BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
PositiveArtificial Intelligence
The recent introduction of BARD10, a balanced benchmark corpus for Bangla authorship attribution, marks a significant advancement in the field. This dataset comprises writings from ten contemporary Bangla authors and serves to systematically analyze the impact of stop-word removal on authorship analysis. The study found that classical models, particularly the TF-IDF + SVM baseline, achieved impressive macro-F1 scores of 0.997 on the BAAD16 dataset and 0.921 on BARD10, outperforming more modern approaches like Bangla BERT, which lagged behind by five points. The findings indicate that authors in the BARD10 dataset are highly sensitive to stop-word pruning, underscoring the stylistic significance of these words in Bangla literature. In contrast, authors in the BAAD16 dataset exhibited greater robustness to such pruning, suggesting genre-dependent reliability. This research not only enhances the understanding of Bangla authorship attribution but also emphasizes the need for finely calibra…
— via World Pulse Now AI Editorial System
