arXiv:2511.07908v1 Announce Type: new 
Abstract: We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton's lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: https://cellarc.mireklzicar.com

تم تقديم CellARC، وهو معيار اصطناعي للتمثيل والتفكير باستخدام الأوتوماتا الخلوية 1D متعددة الألوان، مع إصدار 95,000 حلقة تدريب واثنين من تقسيمات الاختبار. يسمح بأخذ عينات خاضعة للرقابة ودراسات قابلة للتكرار حول أداء النماذج. يُظهر المعيار أن نموذج التحويل الذي يحتوي على 10 ملايين معلمة يتفوق على النماذج الحديثة، محققًا دقة ملحوظة، بينما يحقق نموذج أكبر، GPT-5 High، أداءً جيدًا أيضًا، مما يشير إلى تقدم في قدرات التفكير في الذكاء الاصطناعي.

CellARC, un nuevo benchmark sintético para la abstracción y el razonamiento utilizando autómatas celulares 1D multicolores, fue introducido, liberando 95,000 episodios de entrenamiento y dos divisiones de prueba. Permite un muestreo controlado y estudios reproducibles sobre el rendimiento de los modelos. El benchmark muestra que un transformador de 10 millones de parámetros supera a modelos recientes, alcanzando una precisión significativa, mientras que un modelo más grande, GPT-5 High, también tiene un buen rendimiento, indicando avances en las capacidades de razonamiento de la IA.

CellARC, un nouveau benchmark synthétique pour l'abstraction et le raisonnement utilisant des automates cellulaires 1D multicolores, a été introduit, avec la publication de 95 000 épisodes d'entraînement et de deux ensembles de test. Il permet un échantillonnage contrôlé et des études reproductibles sur la performance des modèles. Le benchmark montre qu'un transformateur de 10 millions de paramètres surpasse des modèles récents, atteignant une précision significative, tandis qu'un modèle plus grand, GPT-5 High, performe également bien, indiquant des avancées dans les capacités de raisonnement de l'IA.

CellARC, a new synthetic benchmark for abstraction and reasoning using multicolor 1D cellular automata, was introduced, releasing 95,000 training episodes and two test splits. It allows for controlled sampling and reproducible studies of model performance. The benchmark shows that a 10M-parameter transformer outperforms recent models, achieving significant accuracy, while a larger model, GPT-5 High, also performs well, indicating advancements in AI reasoning capabilities.

CellARC: Measuring Intelligence with Cellular Automata

Was this article worth reading? Share it

Ready to build your own newsroom?