SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

arXiv — cs.CL•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of Self
SRPO's ability to leverage successful trajectories from the current training batch enhances the efficiency of training, potentially leading to improved performance in robotic manipulation tasks.
This development aligns with ongoing efforts in the field to refine reinforcement learning techniques, as seen in frameworks like AsyncVLA and Distribution

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.CV2 days ago

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

PositiveArtificial Intelligence

The paper introduces Mantis, a new Vision-Language-Action (VLA) model that utilizes Disentangled Visual Foresight (DVF) to enhance visual prediction capabilities. Mantis addresses challenges in existing VLA models, such as high-dimensional visual state prediction and information bottlenecks, by decoupling visual foresight prediction from the backbone using meta queries and a diffusion Transformer head. This innovation aims to improve comprehension and reasoning in VLA systems.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

VisPlay: Self-Evolving Vision-Language Models from Images

PositiveArtificial Intelligence

VisPlay is a self-evolving reinforcement learning framework designed to enhance Vision-Language Models (VLMs) by enabling them to autonomously improve their reasoning capabilities using large amounts of unlabeled image data. It operates by assigning two roles to the model: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together to balance question complexity and answer quality.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

NeutralArtificial Intelligence

Vision-Language-Action models (VLAs) have shown significant advancements in embodied environments, allowing robots to perceive, reason, and act through a unified multimodal understanding. However, their adversarial robustness remains under-researched, particularly in realistic multimodal and black-box scenarios. This paper introduces VLA-Fool, a study focusing on multimodal adversarial robustness in VLAs, addressing issues like textual and visual perturbations and cross-modal misalignment.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

PositiveArtificial Intelligence

VLA-Pruner is a proposed method aimed at enhancing the efficiency of Vision-Language-Action (VLA) models by implementing temporal-aware dual-level visual token pruning. This approach addresses the high computational costs associated with processing continuous visual streams, which limits real-time deployment. By focusing on both high-level semantic understanding and low-level action execution, VLA-Pruner seeks to improve the performance of VLA models significantly.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

EvoVLA: Self-Evolving Vision-Language-Action Model

PositiveArtificial Intelligence

EvoVLA is a self-supervised Vision-Language-Action (VLA) framework designed to enhance long-horizon robotic manipulation. It addresses the issue of stage hallucination in current VLA models by incorporating three components: Stage-Aligned Reward (SAR), Pose-Based Object Exploration (POE), and Long-Horizon Memory. These innovations aim to improve task completion accuracy and overall performance in robotic manipulation tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action

PositiveArtificial Intelligence

RoboTidy is a new benchmark designed for language-guided household tidying, addressing the limitations of current benchmarks that fail to model user preferences and support mobility. It features 500 photorealistic 3D Gaussian Splatting household scenes and provides extensive manipulation and navigation trajectories to facilitate training and evaluation in Vision-Language-Action and Vision-Language-Navigation tasks.

Read full article

via arXiv — cs.CV