arXiv:2511.14531v1 Announce Type: new 
Abstract: With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

يقدم معيار LiveRAG مجموعة بيانات عامة تحتوي على 895 سؤالًا وإجابةً صناعيًا مصممة لتقييم أنظمة الجيل المعزز بالاسترجاع (RAG). تم اشتقاق هذه المجموعة من البيانات من تحدي LiveRAG في SIGIR'2025، وتشتمل على إجابات صحيحة وادعاءات داعمة لم تكن متاحة للمنافسين خلال التحدي. يتم ربط كل سؤال بدرجات صعوبة وتمييز مستمدة من تطبيق نظرية استجابة العناصر، مما يعزز عملية تقييم أنظمة الأسئلة والأجوبة المعتمدة على RAG.

El benchmark LiveRAG presenta un conjunto de datos público de 895 preguntas y respuestas sintéticas diseñado para evaluar sistemas de Generación Aumentada por Recuperación (RAG). Este conjunto de datos, derivado del Desafío LiveRAG de SIGIR'2025, incluye respuestas verdaderas y afirmaciones de apoyo que no estaban disponibles para los competidores durante el desafío. Cada pregunta se asocia con puntajes de dificultad y discriminación basados en la Teoría de Respuesta a Ítems, mejorando así el proceso de evaluación de sistemas de preguntas y respuestas basados en RAG.

Le benchmark LiveRAG présente un ensemble de données publiques de 895 questions et réponses synthétiques destiné à évaluer les systèmes de génération augmentée par récupération (RAG). Cet ensemble de données, dérivé du défi LiveRAG de SIGIR'2025, comprend des réponses véridiques et des affirmations de soutien qui n'étaient pas accessibles aux concurrents pendant le défi. Chaque question est associée à des scores de difficulté et de discriminabilité basés sur la théorie de la réponse à l'item, améliorant ainsi le processus d'évaluation des systèmes de questions-réponses basés sur RAG.

The LiveRAG benchmark introduces a publicly available dataset of 895 synthetic questions and answers aimed at evaluating Retrieval Augmented Generation (RAG) systems. This dataset, derived from the SIGIR'2025 LiveRAG Challenge, includes ground-truth answers and supporting claims that were not accessible to competitors during the challenge. Each question is assigned difficulty and discriminability scores based on Item Response Theory, enhancing the evaluation process for RAG-based Q&A systems.

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

One More Thing in AI – Your Shortcut to AI Mastery

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Linkjob AI

Guidejar-4eb95b

Supametas.AI

Scop.ai

Ready to build your own newsroom?