Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

arXiv — cs.CV•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of Curriculum
The development of CuRPO is crucial as it not only improves performance in Visual Grounding but also provides a framework that can be adapted for various NLP and computer vision tasks, potentially leading to broader applications in AI.
This advancement reflects ongoing efforts in the AI community to refine reasoning processes and enhance model performance, particularly in complex tasks. The exploration of CoT and its implications for generative models continues to be a focal point, as researchers strive to overcome limitations in current methodologies.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV20 hours ago

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

PositiveArtificial Intelligence

LENS is a new reinforcement-learning framework designed for text-prompted image segmentation, enhancing visual understanding crucial for applications in human-computer interaction and robotics. Unlike traditional supervised methods, LENS incorporates explicit chain-of-thought reasoning during testing, improving generalization to unseen prompts. By utilizing a 3-billion-parameter vision-language model, LENS achieves an average cIoU of 81.2% on benchmark datasets, surpassing existing fine-tuning methods.

Read full article

via arXiv — cs.CV

arXiv — cs.CV20 hours ago

Parameter Aware Mamba Model for Multi-task Dense Prediction

PositiveArtificial Intelligence

The Parameter Aware Mamba Model (PAMM) is introduced as a novel decoder-based framework aimed at enhancing multi-task dense prediction. Unlike existing methods that primarily rely on convolutional layers and attention mechanisms, PAMM utilizes state space models to improve task interconnectivity. It features dual state space parameter experts that establish task-specific parameter priors, effectively capturing the unique characteristics of each task. This innovative approach facilitates accurate multi-task interactions and integrates task priors through the structured state space sequence mode…

Read full article

via arXiv — cs.CV

arXiv — cs.CV20 hours ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

NeutralArtificial Intelligence

The article introduces Gen-ViRe, a new benchmark for generative visual reasoning that addresses the limitations of current models in simulating real-world dynamics. While Chain-of-Thought (CoT) prompting has advanced symbolic reasoning in large language models (LLMs), it is limited to discrete text. Gen-ViRe aims to evaluate Chain-of-Frames (CoF) reasoning, which translates thought into visual sequences, thereby assessing cognitive abilities in multi-step planning and abstract reasoning. This benchmark seeks to fill a gap in understanding model capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.CV20 hours ago

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

NeutralArtificial Intelligence

MoHoBench is a newly developed benchmark aimed at assessing the honesty of Multimodal Large Language Models (MLLMs) when confronted with unanswerable visual questions. Despite advancements in vision-language tasks, MLLMs often produce unreliable content. This study systematically evaluates the honesty of 28 popular MLLMs using a dataset of over 12,000 visual questions, revealing that many models struggle to provide honest responses. The findings highlight the need for improved trustworthiness in AI systems.

Read full article

via arXiv — cs.CV

arXiv — cs.CL20 hours ago

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

PositiveArtificial Intelligence

Supervised Fine-Tuning (SFT) is essential for adapting Large Language Models (LLMs) to specialized fields like medical reasoning. Current SFT methods often utilize unfiltered datasets, which can be redundant and of low quality, leading to high computational costs and poor performance. This study introduces a new data selection strategy called Difficulty-Influence Quadrant (DIQ), which aims to optimize sample selection based on both difficulty and optimization utility, enhancing the efficiency of medical reasoning applications.

Read full article

via arXiv — cs.CL

arXiv — cs.CV20 hours ago

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

PositiveArtificial Intelligence

MVI-Bench is introduced as a comprehensive benchmark aimed at evaluating the robustness of Large Vision-Language Models (LVLMs) against misleading visual inputs. Traditional benchmarks have primarily focused on textual inputs, neglecting the significant impact of visual misrepresentation. MVI-Bench categorizes misleading visual inputs into three hierarchical levels: Visual Concept, Visual Attribute, and Visual Relationship, and includes 1,248 annotated Visual Question Answering (VQA) instances to facilitate detailed robustness assessments.

Read full article

via arXiv — cs.CV

arXiv — cs.CV20 hours ago

2D Gaussians Spatial Transport for Point-supervised Density Regression

PositiveArtificial Intelligence

The paper presents Gaussian Spatial Transport (GST), a new framework that utilizes Gaussian splatting to transfer probability measures from image coordinates to annotation maps. It introduces a method for estimating pixel-annotation correspondence, which is used to create a transport plan based on Bayesian probability. A loss function is derived to integrate this transport plan into standard network optimization for computer vision tasks. Experiments in crowd counting and landmark detection demonstrate the approach's effectiveness, improving efficiency by eliminating iterative transport plan c…

Read full article

via arXiv — cs.CV

arXiv — cs.CL20 hours ago

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

NeutralArtificial Intelligence

The article discusses the limitations of current evaluation frameworks for Large Language Models (LLMs), which often focus on technical metrics rather than real-world utility. It introduces a new anthropomorphic evaluation paradigm that includes a three-dimensional taxonomy: Intelligence Quotient (IQ), Emotional Quotient (EQ), and Professional Quotient (PQ). Additionally, it proposes a Value-oriented Evaluation (VQ) framework that assesses economic viability, social impact, ethical alignment, and environmental sustainability, aiming to enhance the deployment of LLMs.

Read full article

via arXiv — cs.CL