ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
The introduction of ResearchRubrics marks a significant advancement in the evaluation of Deep Research (DR) agents, which are increasingly relied upon to tackle open-ended queries using large language models. With over 2,800 hours of human labor invested, this benchmark pairs realistic prompts with 2,500+ expert-written rubrics designed to assess critical aspects such as factual grounding and reasoning soundness. The findings reveal that even top-performing systems like Gemini's DR and OpenAI's DR fall short, achieving less than 68% compliance with the established rubrics. This underscores the pressing need for robust and scalable assessment frameworks in the rapidly evolving landscape of AI-driven research capabilities, as highlighted by the initial evaluation results.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened
PositiveArtificial Intelligence
The article discusses an experiment where an AI model was allowed to write JavaScript code within a self-hosted runtime called Contenox. The author reflects on a concept regarding tool usage in AI, suggesting that models should generate code to utilize tools instead of direct calls. This approach was tested by executing the generated JavaScript within the Contenox environment, aiming to enhance the efficiency of AI workflows.
Sector HQ Weekly Digest - November 17, 2025
NeutralArtificial Intelligence
The Sector HQ Weekly Digest for November 17, 2025, highlights the latest developments in the AI industry, focusing on the performance of top companies. OpenAI leads with a score of 442385.7 and 343 events, followed by Anthropic and Amazon. The report also notes significant movements, with Sony jumping 277 positions in the rankings, reflecting the dynamic nature of the AI sector.
Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate
PositiveArtificial Intelligence
A recent study published on arXiv investigates whether advanced text-to-speech systems can learn social nuances, specifically the human tendency to slow speech for politeness. Researchers tested 22 synthetic voices from AI Studio and OpenAI under polite and casual conditions, finding that the polite prompts resulted in significantly slower speech across both platforms. This suggests that AI can internalize and replicate subtle psychological cues in human communication.