TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

arXiv — cs.CLMonday, November 17, 2025 at 5:00:00 AM
  • The TEDxTN project has launched the first publicly accessible speech translation dataset for Tunisian Arabic to English, comprising 108 TEDx talks and 25 hours of speech. This initiative addresses the data scarcity challenge faced by Arabic dialects and includes diverse accents from 11 regions in Tunisia.
  • The significance of the TEDxTN dataset lies in its potential to enhance research in natural language processing for Tunisian dialects, providing a valuable resource for developers and researchers in the field of AI and linguistics.
  • While there are no directly related articles, the TEDxTN dataset exemplifies a growing trend in creating open
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
PositiveArtificial Intelligence
LaoBench is a newly introduced large-scale benchmark dataset aimed at evaluating large language models (LLMs) in the Lao language. It consists of over 17,000 curated samples that assess knowledge application, foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is designed to enhance the understanding and reasoning capabilities of LLMs in low-resource languages, addressing the current challenges faced by models in mastering Lao.
Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs
PositiveArtificial Intelligence
The study on Referring Expression Comprehension (REC) focuses on localizing objects in images using natural language descriptions. Despite the global need for multilingual applications, existing research has been primarily English-centric. This work introduces a unified multilingual dataset covering 10 languages, created by expanding 12 English benchmarks through machine translation, resulting in about 8 million expressions across 177,620 images and 336,882 annotated objects. Additionally, a new attention-anchored neural architecture is proposed to enhance REC performance.