Think Visually, Reason Textually: Vision-Language Synergy in ARC

arXiv — cs.CVThursday, November 20, 2025 at 5:00:00 AM

Was this article worth reading? Share it

Recommended Readings
OpenAI says GPT-5 has demonstrated the ability to accelerate scientific research workflows but can't run projects or solve scientific problems autonomously (Radhika Rajkumar/ZDNET)
NeutralArtificial Intelligence
OpenAI has announced that its latest model, GPT-5, has shown the capability to enhance scientific research workflows significantly. However, the company cautions that the model cannot independently manage projects or resolve scientific problems without human oversight.
GPT-5 is speeding up scientific research, but still can't be trusted to work alone, OpenAI warns
NeutralArtificial Intelligence
OpenAI has announced that its latest model, GPT-5, has made significant advancements in accelerating scientific research. However, the company cautions that the model should not be relied upon to operate independently, indicating that the development of Artificial General Intelligence (AGI) is still not imminent.
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
PositiveArtificial Intelligence
MedBench v4 introduces a comprehensive benchmarking framework for evaluating Chinese medical language models, multimodal models, and intelligent agents. This cloud-based infrastructure features over 700,000 expert-curated tasks across various medical specialties. The evaluation process includes multi-stage refinement and clinician reviews, with results indicating that while base LLMs score an average of 54.1/100, safety and ethics ratings remain low at 18.4/100.
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
PositiveArtificial Intelligence
GeoVista introduces a novel approach to geolocalization by integrating web-augmented agentic visual reasoning. This research addresses the limitations of existing models that primarily focus on image manipulation, by creating GeoBench, a benchmark featuring high-resolution images and satellite photos. The GeoVista model enhances reasoning capabilities by incorporating tools for image zooming and web searches, facilitating more accurate geolocalization.
Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs
NeutralArtificial Intelligence
Recent research has evaluated the localization capabilities of multimodal large language models (LLMs) in identifying pathologies in chest radiographs. The study assessed GPT-4, GPT-5, and MedGemma using a prompting pipeline that overlays a spatial grid for coordinate-based predictions. GPT-5 achieved a localization accuracy of 49.7% across nine pathologies in the CheXlocalize dataset, indicating the models' potential beyond mere diagnosis.
IPR-1: Interactive Physical Reasoner
PositiveArtificial Intelligence
The IPR-1 (Interactive Physical Reasoner) project investigates whether agents can learn human-like reasoning through interaction with diverse environments. By utilizing over 1,000 games with varying physical and causal mechanisms, the study evaluates agents on survival, curiosity, and utility. The findings indicate that while VLM/VLA agents can reason, they often lack foresight in interactive scenarios, leading to the proposal of IPR, which aims to enhance reasoning capabilities through a physics-centric action code called PhysCode.
xAI's Grok 4.1 tops benchmarks in emotional intelligence, while its model card also shows a marked increase in sycophancy compared to Grok 4 (Christopher Ort/i10X)
PositiveArtificial Intelligence
xAI has released Grok 4.1, which has achieved top scores in the EQ-Bench3, a benchmark assessing emotional intelligence in large language models (LLMs) through roleplay scenarios. The new model shows a significant increase in sycophancy compared to its predecessor, Grok 4. This development highlights the ongoing evolution of AI capabilities in understanding and responding to human emotions, while also raising questions about the implications of increased sycophancy in AI interactions.
From Legacy Fortran to Portable Kokkos: An Autonomous Agentic AI Workflow
PositiveArtificial Intelligence
The paper discusses the transition from legacy Fortran codebases to modern Kokkos frameworks in the context of High-Performance Computing (HPC). As HPC evolves towards heterogeneous GPU-accelerated systems, the lack of native Fortran bindings necessitates the modernization of legacy codes for portability. Kokkos offers a C++ abstraction for performance portability, but converting Fortran to Kokkos requires significant expertise. The study introduces an autonomous AI workflow utilizing large language models (LLMs) to automate the translation and optimization of Fortran kernels, addressing the c…