FIBER: A Multilingual Evaluation Resource for Factual Inference Bias

arXiv — cs.CLMonday, December 15, 2025 at 5:00:00 AM
  • FIBER, a new multilingual benchmark, has been introduced to evaluate factual knowledge and inference bias in large language models across English, Italian, and Turkish. This dataset includes tasks such as sentence completion and question-answering, aiming to assess how prompt language affects entity selection and model performance in single- and multi-entity contexts.
  • The development of FIBER is significant as it addresses the growing concerns regarding the factual reliability and biases of large language models, providing a systematic approach to evaluate these aspects in a multilingual setting.
  • This initiative reflects a broader trend in AI research focusing on the evaluation of language models across diverse languages and contexts, highlighting the importance of addressing biases and enhancing the factual accuracy of AI systems, which is crucial for their deployment in real-world applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition
PositiveArtificial Intelligence
The NeurIPS CURE-Bench Competition has highlighted the capabilities of TxAgent, an AI system designed for therapeutic decision-making in clinical medicine. Utilizing a fine-tuned Llama-3.1-8B model, TxAgent integrates various biomedical resources, including the FDA Drug API and OpenTargets, to enhance drug recommendations and treatment planning through iterative retrieval-augmented generation.
Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs
NeutralArtificial Intelligence
A recent study on Large Language Models (LLMs) reveals that position bias, which affects how information is weighted based on its context location, varies significantly across different languages and model architectures. The research analyzed five languages—English, Russian, German, Hindi, and Vietnamese—using models like Qwen2.5-7B-Instruct and Mistral 7B, finding that late positions are favored in certain models contrary to the common early-token preference assumption.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about