SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

arXiv — cs.CLThursday, November 20, 2025 at 5:00:00 AM
  • The introduction of Self
  • SRPO's ability to leverage successful trajectories from the current training batch enhances the efficiency of training, potentially leading to improved performance in robotic manipulation tasks.
  • This development aligns with ongoing efforts in the field to refine reinforcement learning techniques, as seen in frameworks like AsyncVLA and Distribution
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Continue Readings
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
PositiveArtificial Intelligence
The paper introduces Mantis, a new Vision-Language-Action (VLA) model that utilizes Disentangled Visual Foresight (DVF) to enhance visual prediction capabilities. Mantis addresses challenges in existing VLA models, such as high-dimensional visual state prediction and information bottlenecks, by decoupling visual foresight prediction from the backbone using meta queries and a diffusion Transformer head. This innovation aims to improve comprehension and reasoning in VLA systems.
VisPlay: Self-Evolving Vision-Language Models from Images
PositiveArtificial Intelligence
VisPlay is a self-evolving reinforcement learning framework designed to enhance Vision-Language Models (VLMs) by enabling them to autonomously improve their reasoning capabilities using large amounts of unlabeled image data. It operates by assigning two roles to the model: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together to balance question complexity and answer quality.
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
NeutralArtificial Intelligence
Vision-Language-Action models (VLAs) have shown significant advancements in embodied environments, allowing robots to perceive, reason, and act through a unified multimodal understanding. However, their adversarial robustness remains under-researched, particularly in realistic multimodal and black-box scenarios. This paper introduces VLA-Fool, a study focusing on multimodal adversarial robustness in VLAs, addressing issues like textual and visual perturbations and cross-modal misalignment.
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
PositiveArtificial Intelligence
VLA-Pruner is a proposed method aimed at enhancing the efficiency of Vision-Language-Action (VLA) models by implementing temporal-aware dual-level visual token pruning. This approach addresses the high computational costs associated with processing continuous visual streams, which limits real-time deployment. By focusing on both high-level semantic understanding and low-level action execution, VLA-Pruner seeks to improve the performance of VLA models significantly.
EvoVLA: Self-Evolving Vision-Language-Action Model
PositiveArtificial Intelligence
EvoVLA is a self-supervised Vision-Language-Action (VLA) framework designed to enhance long-horizon robotic manipulation. It addresses the issue of stage hallucination in current VLA models by incorporating three components: Stage-Aligned Reward (SAR), Pose-Based Object Exploration (POE), and Long-Horizon Memory. These innovations aim to improve task completion accuracy and overall performance in robotic manipulation tasks.
RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action
PositiveArtificial Intelligence
RoboTidy is a new benchmark designed for language-guided household tidying, addressing the limitations of current benchmarks that fail to model user preferences and support mobility. It features 500 photorealistic 3D Gaussian Splatting household scenes and provides extensive manipulation and navigation trajectories to facilitate training and evaluation in Vision-Language-Action and Vision-Language-Navigation tasks.