SPHINX: A Synthetic Environment for Visual Perception and Reasoning

arXiv — cs.LG•Thursday, November 27, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

Sphinx has been introduced as a synthetic environment designed for visual perception and reasoning, generating puzzles that assess cognitive skills across 25 task types. This environment allows for precise evaluation and the construction of large-scale datasets, with a focus on tasks such as symmetry detection and spatial reasoning.
The development of Sphinx is significant as it highlights the limitations of current large vision-language models, including GPT-5, which achieved only 51.1% accuracy in these tasks, indicating a gap between AI performance and human capabilities.
This advancement in synthetic environments underscores the ongoing challenges in AI reasoning and perception, as researchers explore methods like reinforcement learning with verifiable rewards to enhance model accuracy. The integration of visual and textual reasoning remains a critical area of focus, reflecting broader trends in AI development aimed at improving multimodal reasoning capabilities.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Pixo.art

Generate stunning AI visuals in seconds with Pixo.art’s effortless design tools.

AI & DataTry the app

Zemith-3bda3b

Your all-in-one AI platform for work and research assistance.

AI & DataTry the app

Jaxo Ai

Access a free suite of AI tools for all your creative and productivity tasks.

AI & DataTry the app

Continue Readings

Techmeme9 hours ago

Study: using the SCONE-bench benchmark of 405 smart contracts, Claude Opus 4.5, Sonnet 4.5, and GPT-5 found and developed exploits collectively worth $4.6M (Anthropic)

NeutralArtificial Intelligence

A recent study utilizing the SCONE-bench benchmark of 405 smart contracts revealed that AI models Claude Opus 4.5, Sonnet 4.5, and GPT-5 collectively identified and developed exploits valued at $4.6 million. This highlights the growing capabilities of AI in cybersecurity tasks, showcasing their potential economic impact.

Read full article

via Techmeme

arXiv — cs.LG15 hours ago

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

PositiveArtificial Intelligence

SUPERChem has been introduced as a new benchmark aimed at evaluating the chemical reasoning capabilities of Large Language Models (LLMs) through 500 expert-curated, reasoning-intensive chemistry problems. This benchmark addresses limitations in current evaluations, such as oversimplified tasks and a lack of process-level assessment, by providing multimodal and text-only formats along with expert-authored solution paths.

Read full article

via arXiv — cs.LG

arXiv — cs.LG15 hours ago

PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

NeutralArtificial Intelligence

The study introduces PARROT, a framework designed to assess the accuracy degradation in large language models (LLMs) under social pressure, particularly focusing on sycophancy. It evaluates 22 models using a double-blind evaluation method, comparing neutral and authoritatively false responses across various domains.

Read full article

via arXiv — cs.LG

Techmeme3 days ago

Alibaba Technical Report: Qwen3-VL beats GPT-5 and Gemini 2.5 Pro on visual tasks and has 100% accuracy on "needle-in-a-haystack" tests for 30-minute videos (Jonathan Kemper/The Decoder)

PositiveArtificial Intelligence

Alibaba has released a technical report on its Qwen3-VL model, which outperforms competitors GPT-5 and Gemini 2.5 Pro in visual tasks and achieves 100% accuracy in 'needle-in-a-haystack' tests for 30-minute videos. This advancement highlights the model's capabilities in analyzing multimodal data, including video and images.

Read full article

via Techmeme