arXiv:2504.14707v2 Announce Type: replace 
Abstract: Standard topic models often struggle to capture culturally specific nuances in text. This study evaluates the effectiveness of contextual embeddings for identifying culturally resonant themes in an underrepresented linguistic context. We compare the performance of KMeans Clustering, Latent Dirichlet Allocation (LDA), and BERTopic on a corpus of nearly 25,000 daily personal narratives written in Belgian-Dutch (Flemish). While LDA achieves strong performance on automated coherence metrics, subsequent human evaluation reveals that BERTopic consistently identifies the most coherent and culturally relevant topics, highlighting the limitations of purely statistical methods on this narrative-rich data. Furthermore, the diminished performance of K-Means compared to prior work on similar Dutch corpora underscores the unique linguistic challenges posed by personal narrative analysis. Our findings demonstrate the critical role of contextual embeddings in robust topic modeling and emphasize the need for human-centered evaluation, particularly when working with low-resource languages and culturally specific domains.

أجرت دراسة تقييمًا لفعالية BERTopic في تحليل ما يقرب من 25,000 سرد يومي باللغة الهولندية البلجيكية، وكشفت أن النموذج يحدد مواضيع أكثر تماسكًا وملاءمة ثقافيًا مقارنة بالطرق التقليدية مثل LDA وKMeans. وهذا يبرز أهمية التضمينات السياقية والتقييم المتمركز حول الإنسان في نمذجة الموضوعات، خاصةً للغات الأقل تمثيلًا.

Un estudio evaluó la efectividad de BERTopic en el análisis de casi 25,000 narrativas diarias en neerlandés belga, revelando que identifica temas más coherentes y culturalmente relevantes en comparación con métodos tradicionales como LDA y KMeans. Esto resalta la importancia de los embeddings contextuales y la evaluación centrada en el ser humano en la modelización de temas, especialmente para lenguas subrepresentadas.

Une étude a évalué l'efficacité de BERTopic dans l'analyse de près de 25 000 récits quotidiens en néerlandais belge, révélant qu'il identifie des sujets plus cohérents et culturellement pertinents par rapport à des méthodes traditionnelles comme LDA et KMeans. Cela souligne l'importance des embeddings contextuels et de l'évaluation centrée sur l'humain dans la modélisation des sujets, en particulier pour les langues sous-représentées.

A study evaluated the effectiveness of BERTopic in analyzing nearly 25,000 Belgian-Dutch daily narratives, revealing that it identifies more coherent and culturally relevant topics compared to traditional methods like LDA and KMeans. This highlights the importance of contextual embeddings and human-centered evaluation in topic modeling, especially for underrepresented languages.

Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

Was this article worth reading? Share it

Ready to build your own newsroom?