SPHINX: A Synthetic Environment for Visual Perception and Reasoning

arXiv — cs.LGThursday, November 27, 2025 at 5:00:00 AM
  • Sphinx has been introduced as a synthetic environment designed for visual perception and reasoning, generating puzzles that assess cognitive skills across 25 task types. This environment allows for precise evaluation and the construction of large-scale datasets, with a focus on tasks such as symmetry detection and spatial reasoning.
  • The development of Sphinx is significant as it highlights the limitations of current large vision-language models, including GPT-5, which achieved only 51.1% accuracy in these tasks, indicating a gap between AI performance and human capabilities.
  • This advancement in synthetic environments underscores the ongoing challenges in AI reasoning and perception, as researchers explore methods like reinforcement learning with verifiable rewards to enhance model accuracy. The integration of visual and textual reasoning remains a critical area of focus, reflecting broader trends in AI development aimed at improving multimodal reasoning capabilities.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Study: using the SCONE-bench benchmark of 405 smart contracts, Claude Opus 4.5, Sonnet 4.5, and GPT-5 found and developed exploits collectively worth $4.6M (Anthropic)
NeutralArtificial Intelligence
A recent study utilizing the SCONE-bench benchmark of 405 smart contracts revealed that AI models Claude Opus 4.5, Sonnet 4.5, and GPT-5 collectively identified and developed exploits valued at $4.6 million. This highlights the growing capabilities of AI in cybersecurity tasks, showcasing their potential economic impact.
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
PositiveArtificial Intelligence
SUPERChem has been introduced as a new benchmark aimed at evaluating the chemical reasoning capabilities of Large Language Models (LLMs) through 500 expert-curated, reasoning-intensive chemistry problems. This benchmark addresses limitations in current evaluations, such as oversimplified tasks and a lack of process-level assessment, by providing multimodal and text-only formats along with expert-authored solution paths.
PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
NeutralArtificial Intelligence
The study introduces PARROT, a framework designed to assess the accuracy degradation in large language models (LLMs) under social pressure, particularly focusing on sycophancy. It evaluates 22 models using a double-blind evaluation method, comparing neutral and authoritatively false responses across various domains.
Alibaba Technical Report: Qwen3-VL beats GPT-5 and Gemini 2.5 Pro on visual tasks and has 100% accuracy on "needle-in-a-haystack" tests for 30-minute videos (Jonathan Kemper/The Decoder)
PositiveArtificial Intelligence
Alibaba has released a technical report on its Qwen3-VL model, which outperforms competitors GPT-5 and Gemini 2.5 Pro in visual tasks and achieves 100% accuracy in 'needle-in-a-haystack' tests for 30-minute videos. This advancement highlights the model's capabilities in analyzing multimodal data, including video and images.