Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

arXiv — cs.CVWednesday, November 19, 2025 at 5:00:00 AM
  • The introduction of Curriculum
  • The development of CuRPO is crucial as it not only improves performance in Visual Grounding but also provides a framework that can be adapted for various NLP and computer vision tasks, potentially leading to broader applications in AI.
  • This advancement reflects ongoing efforts in the AI community to refine reasoning processes and enhance model performance, particularly in complex tasks. The exploration of CoT and its implications for generative models continues to be a focal point, as researchers strive to overcome limitations in current methodologies.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
LENS: Learning to Segment Anything with Unified Reinforced Reasoning
PositiveArtificial Intelligence
LENS is a new reinforcement-learning framework designed for text-prompted image segmentation, enhancing visual understanding crucial for applications in human-computer interaction and robotics. Unlike traditional supervised methods, LENS incorporates explicit chain-of-thought reasoning during testing, improving generalization to unseen prompts. By utilizing a 3-billion-parameter vision-language model, LENS achieves an average cIoU of 81.2% on benchmark datasets, surpassing existing fine-tuning methods.
Parameter Aware Mamba Model for Multi-task Dense Prediction
PositiveArtificial Intelligence
The Parameter Aware Mamba Model (PAMM) is introduced as a novel decoder-based framework aimed at enhancing multi-task dense prediction. Unlike existing methods that primarily rely on convolutional layers and attention mechanisms, PAMM utilizes state space models to improve task interconnectivity. It features dual state space parameter experts that establish task-specific parameter priors, effectively capturing the unique characteristics of each task. This innovative approach facilitates accurate multi-task interactions and integrates task priors through the structured state space sequence mode…
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
NeutralArtificial Intelligence
The article introduces Gen-ViRe, a new benchmark for generative visual reasoning that addresses the limitations of current models in simulating real-world dynamics. While Chain-of-Thought (CoT) prompting has advanced symbolic reasoning in large language models (LLMs), it is limited to discrete text. Gen-ViRe aims to evaluate Chain-of-Frames (CoF) reasoning, which translates thought into visual sequences, thereby assessing cognitive abilities in multi-step planning and abstract reasoning. This benchmark seeks to fill a gap in understanding model capabilities.
MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
NeutralArtificial Intelligence
MoHoBench is a newly developed benchmark aimed at assessing the honesty of Multimodal Large Language Models (MLLMs) when confronted with unanswerable visual questions. Despite advancements in vision-language tasks, MLLMs often produce unreliable content. This study systematically evaluates the honesty of 28 popular MLLMs using a dataset of over 12,000 visual questions, revealing that many models struggle to provide honest responses. The findings highlight the need for improved trustworthiness in AI systems.
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
PositiveArtificial Intelligence
Supervised Fine-Tuning (SFT) is essential for adapting Large Language Models (LLMs) to specialized fields like medical reasoning. Current SFT methods often utilize unfiltered datasets, which can be redundant and of low quality, leading to high computational costs and poor performance. This study introduces a new data selection strategy called Difficulty-Influence Quadrant (DIQ), which aims to optimize sample selection based on both difficulty and optimization utility, enhancing the efficiency of medical reasoning applications.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
PositiveArtificial Intelligence
MVI-Bench is introduced as a comprehensive benchmark aimed at evaluating the robustness of Large Vision-Language Models (LVLMs) against misleading visual inputs. Traditional benchmarks have primarily focused on textual inputs, neglecting the significant impact of visual misrepresentation. MVI-Bench categorizes misleading visual inputs into three hierarchical levels: Visual Concept, Visual Attribute, and Visual Relationship, and includes 1,248 annotated Visual Question Answering (VQA) instances to facilitate detailed robustness assessments.
2D Gaussians Spatial Transport for Point-supervised Density Regression
PositiveArtificial Intelligence
The paper presents Gaussian Spatial Transport (GST), a new framework that utilizes Gaussian splatting to transfer probability measures from image coordinates to annotation maps. It introduces a method for estimating pixel-annotation correspondence, which is used to create a transport plan based on Bayesian probability. A loss function is derived to integrate this transport plan into standard network optimization for computer vision tasks. Experiments in crowd counting and landmark detection demonstrate the approach's effectiveness, improving efficiency by eliminating iterative transport plan c…
Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
NeutralArtificial Intelligence
The article discusses the limitations of current evaluation frameworks for Large Language Models (LLMs), which often focus on technical metrics rather than real-world utility. It introduces a new anthropomorphic evaluation paradigm that includes a three-dimensional taxonomy: Intelligence Quotient (IQ), Emotional Quotient (EQ), and Professional Quotient (PQ). Additionally, it proposes a Value-oriented Evaluation (VQ) framework that assesses economic viability, social impact, ethical alignment, and environmental sustainability, aiming to enhance the deployment of LLMs.