arXiv:2511.08085v1 Announce Type: new 
Abstract: This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.

تقديم BARD10، مجموعة بيانات مرجعية جديدة لتحديد مؤلفي النصوص باللغة البنغالية، يبرز أهمية الكلمات التوقفية في تحليل أنماط الكتابة. تكشف هذه الدراسة أنه بينما تفوقت النماذج التقليدية مثل TF-IDF + SVM في الأداء، فإن Bangla BERT أظهر فعالية أقل، خاصة في سياق إزالة الكلمات التوقفية. فهم دور الكلمات التوقفية أمر حاسم لتحسين تقنيات تحديد المؤلفين في الأدب البنغالي.

La introducción de BARD10, un nuevo corpus de referencia para la atribución de autoría en bangla, resalta la importancia de las palabras vacías en el análisis de estilos de escritura. Este estudio revela que, aunque los modelos clásicos como TF-IDF + SVM sobresalieron en rendimiento, Bangla BERT mostró una efectividad inferior, especialmente en el contexto de la eliminación de palabras vacías. Comprender el papel de las palabras vacías es crucial para mejorar las técnicas de atribución de autoría en la literatura bangla.

L'introduction de BARD10, un nouveau corpus de référence pour l'attribution d'auteur en bangla, met en lumière l'importance des mots vides dans l'analyse des styles d'écriture. Cette étude révèle que, bien que les modèles classiques comme TF-IDF + SVM aient excellé en performance, Bangla BERT a montré une efficacité inférieure, en particulier dans le contexte de la suppression des mots vides. Comprendre le rôle des mots vides est crucial pour améliorer les techniques d'attribution d'auteur dans la littérature bangla.

The introduction of BARD10, a new benchmark corpus for Bangla authorship attribution, highlights the importance of stop-words in analyzing writing styles. This study reveals that while classical models like TF-IDF + SVM excelled in performance, Bangla BERT showed lower effectiveness, particularly in the context of stop-word removal. Understanding the role of stop-words is crucial for improving authorship attribution techniques in Bangla literature.

BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Was this article worth reading? Share it

Ready to build your own newsroom?