arXiv:2511.02817v1 Announce Type: new 
Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

تناقش المقالة التحديات المتعلقة بتقييم التفكير في السياقات الطويلة في النماذج مع زيادة أطوال السياقات. وتبرز أن العديد من التقييمات تركز على مهام الاسترجاع، مما قد يتجاهل أجزاء كبيرة من السياق، مما يثير تساؤلات حول فعالية النماذج في استخدام السياق بالكامل.

El artículo discute los desafíos de evaluar el razonamiento de largo contexto en los modelos a medida que aumentan las longitudes de contexto. Destaca que muchas evaluaciones se centran en tareas de recuperación, lo que puede pasar por alto porciones significativas del contexto, planteando preguntas sobre la efectividad de los modelos para utilizar todo el contexto.

L'article aborde les défis de l'évaluation du raisonnement à long contexte dans les modèles alors que les longueurs de contexte augmentent. Il souligne que de nombreuses évaluations se concentrent sur des tâches de récupération, ce qui peut négliger des portions significatives du contexte, soulevant des questions sur l'efficacité des modèles à utiliser l'intégralité du contexte.

The article discusses the challenges of evaluating long context reasoning in models as context lengths increase. It highlights that many evaluations focus on retrieval tasks, which may overlook significant portions of the context, raising questions about the models' effectiveness in utilizing the entire context.

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Was this article worth reading? Share it

Ready to build your own newsroom?