arXiv:2512.01457v1 Announce Type: new 
Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

تم تقديم طريقة جديدة تُدعى ZIP-RC لتحسين قدرات الاستدلال لنماذج اللغة الكبيرة (LLMs) من خلال تمكين التنبؤ في الوقت الحقيقي بالمكافآت والتكاليف أثناء التوليد. تتناول هذه الطريقة قيود طرق التوسع في وقت الاختبار، التي غالبًا ما تؤدي إلى زيادة التكاليف والكمون دون توفير قدرات استدلال تكيفية.

Se ha introducido un nuevo método llamado ZIP-RC para mejorar las capacidades de inferencia de los grandes modelos de lenguaje (LLMs) al permitir la predicción en tiempo real de recompensas y costos durante la generación. Este enfoque aborda las limitaciones de los métodos de escalado en el momento de la prueba, que a menudo conducen a un aumento de costos y latencia sin proporcionar capacidades de inferencia adaptativa.

Une nouvelle méthode appelée ZIP-RC a été introduite pour améliorer les capacités d'inférence des grands modèles de langage (LLMs) en permettant la prédiction en temps réel des récompenses et des coûts pendant la génération. Cette approche répond aux limites des méthodes de mise à l'échelle au moment du test, qui entraînent souvent des coûts et une latence accrus sans fournir de capacités d'inférence adaptative.

A new method called ZIP-RC has been introduced to enhance the inference capabilities of large language models (LLMs) by enabling real-time prediction of reward and cost during generation. This approach addresses the limitations of existing test-time scaling methods, which often lead to increased costs and latency without providing adaptive inference capabilities.

ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation

arXiv:2512.02719v1 Announce Type: cross 
Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

قدمت دراسة حديثة معيارًا سلوكيًا يسمى BayesBench لتقييم أداء نماذج اللغة الكبيرة (LLMs) في مهام الدمج متعدد الوسائط، مستوحاة من أبحاث علم النفس الفيزيائي. تقيم الدراسة تسعة نماذج LLM، بما في ذلك GPT-5 Mini، من خلال مهام تقدير الحجم التي تتضمن نصوصًا وصورًا، مما يكشف عن رؤى حول استراتيجياتها الحسابية الضمنية وسلوكها البايزي.

Un estudio reciente ha introducido un benchmark de comportamiento llamado BayesBench para evaluar el rendimiento de los modelos de lenguaje grandes (LLMs) en tareas de integración multimodal, inspirado en la investigación en psicofísica. El estudio evalúa nueve LLMs, incluido GPT-5 Mini, a través de tareas de estimación de magnitud que involucran texto e imágenes, revelando información sobre sus estrategias computacionales implícitas y su comportamiento bayesiano.

Une étude récente a introduit un benchmark comportemental appelé BayesBench pour évaluer la performance des grands modèles de langage (LLMs) dans des tâches d'intégration multimodale, inspirée par la recherche en psychophysique. L'étude évalue neuf LLMs, y compris GPT-5 Mini, à travers des tâches d'estimation de magnitude impliquant du texte et des images, révélant des aperçus sur leurs stratégies computationnelles implicites et leur comportement bayésien.

A recent study has introduced a behavioral benchmark called BayesBench to evaluate the performance of large language models (LLMs) in multimodal integration tasks, inspired by psychophysics research. The study assesses nine LLMs, including GPT-5 Mini, through magnitude estimation tasks involving text and images, revealing insights into their implicit computational strategies and Bayesian behavior.

ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation

Was this article worth reading? Share it

Hypertune

Scop.ai

CodeSpaced