MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

arXiv — cs.LGWednesday, November 26, 2025 at 5:00:00 AM
  • The MAPS framework has been introduced to enhance Vision-Language-Action (VLA) models by preserving their pretrained representations during fine-tuning. This approach systematically relaxes proximity constraints on different model components, allowing visual encoders to maintain stability while enabling action-oriented language layers to adapt more freely.
  • This development is significant as it addresses the common issue of disrupted representations in VLA models during naive fine-tuning, which can hinder their generalization capabilities. By integrating MAPS, existing models can achieve better performance without the need for additional parameters or data.
  • The introduction of MAPS aligns with ongoing advancements in VLA frameworks, such as the integration of self-referential optimization and active visual attention, which aim to improve model efficiency and decision-making. These developments reflect a broader trend in AI research focusing on enhancing multimodal capabilities and addressing the limitations of traditional reinforcement learning methods.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
PositiveArtificial Intelligence
CropVLM has been introduced as a novel external method designed to enhance Vision-Language Models (VLMs) by enabling them to dynamically focus on specific image regions, thereby improving their performance in tasks requiring fine-grained image understanding. This model utilizes reinforcement learning without the need for human-labeled bounding boxes, making it a cost-effective solution for boosting VLM capabilities.
Adapting Vision-Language Models for Evaluating World Models
PositiveArtificial Intelligence
A new evaluation protocol has been introduced to enhance the assessment of world models, which are generative models simulating environment dynamics based on past observations and actions. This protocol focuses on two recognition tasks: action recognition and character recognition, utilizing Vision-Language Models (VLMs) for fine-grained evaluation. The framework, named UNIVERSE, aims to address the limitations of existing metrics in evaluating generative content.
L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
PositiveArtificial Intelligence
Researchers have introduced L2V-CoT, a novel training-free approach that facilitates the transfer of Chain-of-Thought (CoT) reasoning from large language models (LLMs) to Vision-Language Models (VLMs) using Linear Artificial Tomography (LAT). This method addresses the challenges VLMs face in multi-step reasoning tasks due to limited multimodal reasoning data.
Mixture of Horizons in Action Chunking
PositiveArtificial Intelligence
A new study on Vision-Language-Action (VLA) models highlights the importance of action chunk length, termed horizon, in robotic manipulation. The research reveals a trade-off between longer horizons, which enhance global foresight, and shorter ones that improve local control but struggle with long-term tasks. To address this, a mixture of horizons (MoH) strategy is proposed, allowing for parallel processing of action chunks with varying horizons.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
PositiveArtificial Intelligence
ActDistill has been introduced as a general action-guided self-derived distillation framework aimed at enhancing the efficiency of Vision-Language-Action (VLA) models. This innovative approach focuses on transferring action prediction capabilities from a well-trained VLA model to a lightweight version, addressing the computational overhead and inference latency that limit robotic manipulation applications.
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
PositiveArtificial Intelligence
A recent study has proposed a new framework called Subspace Projection Debiasing (SPD) to address the pervasive demographic biases in Vision-Language Models (VLMs). This framework challenges the traditional post-hoc debiasing methods that focus on coordinate-wise adjustments, revealing that biases are distributed across linear subspaces rather than isolated coordinates.
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
PositiveArtificial Intelligence
The introduction of Perceptual-Evidence Anchored Reinforced Learning (PEARL) marks a significant advancement in multimodal reasoning, addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) in Vision-Language Models (VLMs). PEARL enhances reasoning by anchoring it to verified visual evidence, thus mitigating issues like visual hallucinations and reward hacking.
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
PositiveArtificial Intelligence
A new study has introduced Foresight Intelligence, defined as the ability to anticipate and interpret future events, crucial for applications like autonomous driving. The research presents FSU-QA, a Visual Question-Answering dataset aimed at evaluating this capability in Vision-Language Models (VLMs). Initial findings indicate that current models face challenges in reasoning about future scenarios.