ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
The introduction of ResearchRubrics marks a significant advancement in the evaluation of Deep Research (DR) agents, which are increasingly relied upon to tackle open-ended queries using large language models. With over 2,800 hours of human labor invested, this benchmark pairs realistic prompts with 2,500+ expert-written rubrics designed to assess critical aspects such as factual grounding and reasoning soundness. The findings reveal that even top-performing systems like Gemini's DR and OpenAI's DR fall short, achieving less than 68% compliance with the established rubrics. This underscores the pressing need for robust and scalable assessment frameworks in the rapidly evolving landscape of AI-driven research capabilities, as highlighted by the initial evaluation results.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about