SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data

arXiv — cs.CLTuesday, November 25, 2025 at 5:00:00 AM
  • SmolKalam has been introduced as a new translation system designed to enhance the quality of Arabic post-training data by utilizing a multi-model ensemble translation pipeline and applying rigorous quality filtering techniques. This initiative addresses the existing gap in high-quality, large-scale Arabic datasets that incorporate reasoning and tool calling, which are essential for advanced AI applications.
  • The development of SmolKalam is significant as it aims to improve the quality of Arabic language processing, which has been a challenge due to the complexity of the language's structure. By focusing on post-training data quality, SmolKalam could facilitate better performance in AI models, ultimately benefiting various applications in natural language processing and machine learning.
  • This advancement reflects a broader trend in the AI community towards enhancing language models through improved data quality and curation methods. The introduction of systems like SmolKalam and ArbESC+ for grammatical error correction highlights the ongoing efforts to address the unique challenges posed by Arabic, emphasizing the need for collaborative approaches in developing robust AI solutions.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
PositiveArtificial Intelligence
A new approach to Arabic Automatic Speech Recognition (ASR) has been introduced, leveraging context-aware prompting strategies to adapt OpenAI's Whisper model. This method addresses the challenges posed by Arabic's dialectal variations and limited labeled data, achieving significant reductions in word error rates for both Modern Standard Arabic and dialectal speech.
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
PositiveArtificial Intelligence
A new moderation filter named FanarGuard has been introduced, designed specifically for Arabic language models. This bilingual filter assesses both safety and cultural alignment in Arabic and English, utilizing a dataset of over 468,000 prompt-response pairs evaluated by human raters. The development aims to address the shortcomings of existing moderation systems that often neglect cultural nuances.
AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs
PositiveArtificial Intelligence
The paper introduces AraFinNews, a comprehensive Arabic financial news dataset consisting of 212,500 article-headline pairs, covering nearly a decade of reporting from October 2015 to July 2025. This dataset is designed to enhance the performance of large language models (LLMs) in summarizing Arabic financial texts, addressing the challenges posed by domain specificity in abstractive summarization.