arXiv:2511.14148v1 Announce Type: cross 
Abstract: Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.

AsyncVLA هو إطار جديد لنماذج رؤية-لغة-عمل (VLA) يعالج قيود نماذج المطابقة المتزامنة (SFM). غالبًا ما تؤدي SFM إلى عدم الاستقرار في المهام طويلة الأمد بسبب الجداول الزمنية الصارمة ونقص الوعي بسياق العمل. يقدم AsyncVLA مطابقة تدفق غير متزامنة (AFM)، مما يسمح بالمرونة الزمنية والتصحيح الذاتي في توليد العمل. تعزز هذه الابتكار قدرة النموذج على تحسين الأعمال بناءً على تقييمات الثقة، مما يحسن الأداء العام في التطبيقات الروبوتية.

AsyncVLA es un nuevo marco para los modelos de Acción-Lenguaje-Visión (VLA) que aborda las limitaciones de los modelos de coincidencia de flujo sincrónicos (SFM). El SFM a menudo conduce a inestabilidad en tareas a largo plazo debido a horarios rígidos y falta de conciencia contextual de las acciones. AsyncVLA introduce la coincidencia de flujo asincrónica (AFM), permitiendo flexibilidad temporal y autocorrección en la generación de acciones. Esta innovación mejora la capacidad del modelo para refinar acciones basadas en calificaciones de confianza, mejorando el rendimiento general en aplicaci…

AsyncVLA est un nouveau cadre pour les modèles Vision-Langage-Action (VLA) qui répond aux limites des modèles de correspondance de flux synchrones (SFM). Le SFM entraîne souvent une instabilité dans les tâches à long terme en raison de calendriers rigides et d'un manque de conscience contextuelle des actions. AsyncVLA introduit la correspondance de flux asynchrone (AFM), permettant une flexibilité temporelle et une autocorrection dans la génération d'actions. Cette innovation améliore la capacité du modèle à affiner les actions en fonction des évaluations de confiance, améliorant ainsi les per…

AsyncVLA is a new framework for Vision-Language-Action (VLA) models that addresses the limitations of traditional synchronous flow matching (SFM). SFM often leads to instability in long-horizon tasks due to rigid time schedules and lack of action context awareness. AsyncVLA introduces asynchronous flow matching (AFM), allowing for temporal flexibility and self-correction in action generation. This innovation enhances the model's ability to refine actions based on confidence ratings, improving overall performance in robotic applications.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

arXiv:2511.16449v1 Announce Type: new 
Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

VLA-Pruner هو أسلوب مقترح يهدف إلى تحسين كفاءة نماذج اللغة-الرؤية-العمل (VLA) من خلال تنفيذ تقليم رموز بصرية على مستويين مع مراعاة الزمن. تتناول هذه الطريقة التكاليف الحسابية العالية المرتبطة بمعالجة تدفقات بصرية مستمرة، مما يحد من النشر في الوقت الحقيقي. من خلال التركيز على كل من الفهم الدلالي على مستوى عالٍ وتنفيذ العمل على مستوى منخفض، يسعى VLA-Pruner إلى تحسين أداء نماذج VLA بشكل كبير.

VLA-Pruner es un método propuesto que busca mejorar la eficiencia de los modelos de Acción-Lenguaje-Visión (VLA) mediante la implementación de un recorte de tokens visuales a dos niveles y consciente del tiempo. Este enfoque aborda los altos costos computacionales asociados con el procesamiento de flujos visuales continuos, lo que limita el despliegue en tiempo real. Al centrarse tanto en la comprensión semántica de alto nivel como en la ejecución de acciones de bajo nivel, VLA-Pruner busca mejorar significativamente el rendimiento de los modelos VLA.

VLA-Pruner est une méthode proposée visant à améliorer l'efficacité des modèles Vision-Language-Action (VLA) en mettant en œuvre un élagage de jetons visuels à deux niveaux et conscient du temps. Cette approche répond aux coûts computationnels élevés associés au traitement de flux visuels continus, ce qui limite le déploiement en temps réel. En se concentrant à la fois sur la compréhension sémantique de haut niveau et l'exécution d'actions de bas niveau, VLA-Pruner vise à améliorer considérablement les performances des modèles VLA.

VLA-Pruner is a proposed method aimed at enhancing the efficiency of Vision-Language-Action (VLA) models by implementing temporal-aware dual-level visual token pruning. This approach addresses the high computational costs associated with processing continuous visual streams, which limits real-time deployment. By focusing on both high-level semantic understanding and low-level action execution, VLA-Pruner seeks to improve the performance of VLA models significantly.

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

arXiv:2511.15605v1 Announce Type: cross 
Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.

تعتبر تحسين السياسة الذاتية (SRPO) إطارًا جديدًا لنماذج العمل-اللغة-الرؤية (VLA) التي تعالج القيود المفروضة على أساليب التعلم المعزز التقليدية. من خلال الاستفادة من المسارات الناجحة للنموذج كمرجع، يلغي SRPO الحاجة إلى العروض الخارجية أو هندسة المكافآت اليدوية. تتيح هذه الابتكارات تخصيص مكافآت للمحاولات الفاشلة، مما يعزز كفاءة التدريب ويتجاوز تحيز العرض.

La Optimización de Políticas Autorreferenciales (SRPO) es un nuevo marco para los modelos de Acción-Lenguaje-Visión (VLA) que aborda las limitaciones de los métodos tradicionales de aprendizaje por refuerzo (RL). Al utilizar las trayectorias exitosas del modelo como referencia, SRPO elimina la necesidad de demostraciones externas y de ingeniería de recompensas manual. Esta innovación permite asignar recompensas a intentos fallidos, mejorando la eficiencia del entrenamiento y superando el sesgo de demostración.

L'optimisation de politique autoréférentielle (SRPO) est un nouveau cadre pour les modèles d'action langage-vision (VLA) qui répond aux limites des méthodes d'apprentissage par renforcement (RL) traditionnelles. En utilisant les trajectoires réussies du modèle comme référence, le SRPO élimine le besoin de démonstrations externes et d'ingénierie de récompense manuelle. Cette innovation permet d'attribuer des récompenses aux tentatives échouées, améliorant ainsi l'efficacité de l'entraînement et surmontant le biais de démonstration.

Self-Referential Policy Optimization (SRPO) is a new framework for Vision-Language-Action (VLA) models that addresses the limitations of traditional reinforcement learning (RL) methods. By utilizing the model's own successful trajectories for self-reference, SRPO eliminates the need for external demonstrations and manual reward engineering. This innovation allows for assigning rewards to failed attempts, enhancing training efficiency and overcoming demonstration bias.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Was this article worth reading? Share it