TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

arXiv — cs.CLWednesday, November 26, 2025 at 5:00:00 AM
  • A new benchmark called TurnBench has been introduced to evaluate multi-turn, multi-step reasoning in large language models (LLMs). This benchmark is designed through an interactive code-breaking task, requiring models to uncover hidden rules by making sequential guesses and integrating feedback over multiple rounds. The benchmark features two modes: Classic and Nightmare, each testing different levels of reasoning complexity.
  • The development of TurnBench is significant as it addresses the limitations of existing benchmarks that primarily focus on single-turn tasks. By evaluating iterative reasoning, TurnBench aims to enhance the capabilities of LLMs in real-world applications, ensuring that they can adapt and maintain consistency over time, which is crucial for complex problem-solving.
  • The introduction of TurnBench reflects a growing recognition of the need for more sophisticated evaluation methods in AI, particularly for LLMs. This aligns with ongoing discussions about the reasoning abilities of these models, as seen in other benchmarks like the Premise Critique Bench and JudgeBoard, which also seek to improve the assessment of reasoning tasks and the overall reliability of AI outputs.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Look to the human brain for a glimpse of AI’s future
PositiveArtificial Intelligence
Recent discussions highlight the potential of the human brain as a low-power model for the future of artificial intelligence (AI), particularly in the development of large language models (LLMs). This perspective shifts the focus from AI's traditionally high energy demands to a more sustainable approach inspired by biological systems.
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
NeutralArtificial Intelligence
The introduction of MindEval marks a significant advancement in the evaluation of language models for multi-turn mental health support, addressing the limitations of current AI chatbots that often reinforce maladaptive beliefs. Developed in collaboration with Ph.D-level Licensed Clinical Psychologists, this framework aims to enhance the realism of simulated therapeutic conversations through automated evaluation methods.
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
PositiveArtificial Intelligence
The introduction of Sparse Sparse Attention (SSA) aims to enhance the efficiency of large language models (LLMs) by aligning outputs from both sparse and full attention mechanisms. This approach addresses the limitations of traditional sparse attention methods, which often suffer from performance degradation due to inadequate gradient updates during training. SSA proposes a unified framework that seeks to improve attention sparsity while maintaining model effectiveness.
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali
PositiveArtificial Intelligence
The introduction of BengaliFig marks a significant advancement in evaluating large language models (LLMs) in low-resource contexts, specifically targeting figurative and culturally grounded reasoning in Bengali. This dataset comprises 435 unique riddles from Bengali oral and literary traditions, annotated across multiple dimensions to enhance understanding of cultural nuances.
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
PositiveArtificial Intelligence
The QiMeng-Kernel framework introduces a Macro-Thinking Micro-Coding paradigm aimed at enhancing the generation of high-performance GPU kernels for AI and scientific computing. This approach addresses the challenges of correctness and efficiency in existing LLM-based methods by decoupling optimization strategies from implementation details, thereby improving both aspects significantly.
Counterfactual Simulatability of LLM Explanations for Generation Tasks
NeutralArtificial Intelligence
Large Language Models (LLMs) exhibit unpredictable behavior, where minor prompt changes can lead to significant output variations. A recent study introduces counterfactual simulatability as a framework to evaluate LLM explanations, particularly in generation tasks like news summarization and medical suggestions, revealing that while summarization predictions improved, medical suggestions require further enhancement.
LaajMeter: A Framework for LaaJ Evaluation
PositiveArtificial Intelligence
LaajMeter has been introduced as a simulation-based framework aimed at enhancing the evaluation of Large Language Models (LLMs) in the context of LaaJ (LLM-as-a-Judge). This framework addresses the challenges of meta-evaluation in domain-specific contexts, where annotated data is limited and expert evaluations are costly, thus providing a systematic approach to assess evaluation metrics effectively.
HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations
PositiveArtificial Intelligence
HyperbolicRAG has been introduced as an innovative retrieval framework that enhances retrieval-augmented generation (RAG) by integrating hyperbolic geometry. This approach aims to improve the representation of complex knowledge graphs, addressing limitations of traditional Euclidean embeddings that fail to capture hierarchical relationships effectively.