arXiv:2511.16203v1 Announce Type: new 
Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

أظهرت نماذج اللغة-الرؤية-العمل (VLA) تقدمًا ملحوظًا في البيئات المجسدة، مما يمكّن الروبوتات من الإدراك والتفكير والعمل من خلال فهم متعدد الوسائط موحد. ومع ذلك، لا تزال قوة هذه الأنظمة ضد الهجمات العدائية غير مستكشفة إلى حد كبير، خاصة في ظل ظروف متعددة الوسائط وبيئات مغلقة. يقدم هذا المقال VLA-Fool، وهو دراسة تركز على قوة الهجمات العدائية متعددة الوسائط في نماذج VLA، مع معالجة مشكلات مثل الاضطرابات النصية والمرئية وعدم التوافق بين الوسائط.

Los modelos de Acción-Lenguaje-Visión (VLA) han mostrado avances significativos en entornos incorporados, permitiendo a los robots percibir, razonar y actuar a través de una comprensión multimodal unificada. Sin embargo, la robustez adversarial de estos sistemas sigue siendo poco explorada, especialmente en condiciones multimodales realistas y de caja negra. Este artículo presenta VLA-Fool, un estudio que se centra en la robustez adversarial multimodal en modelos VLA, abordando problemas como perturbaciones textuales y visuales y desalineación intermodal.

Les modèles Vision-Language-Action (VLA) ont montré des avancées significatives dans des environnements incarnés, permettant aux robots de percevoir, raisonner et agir grâce à une compréhension multimodale unifiée. Cependant, leur robustesse face aux attaques adversariales reste peu explorée, en particulier dans des scénarios multimodaux réalistes et en boîte noire. Cet article présente VLA-Fool, une étude axée sur la robustesse adversariale multimodale des VLA, abordant des problèmes tels que les perturbations textuelles et visuelles et le désalignement intermodal.

Vision-Language-Action models (VLAs) have shown significant advancements in embodied environments, allowing robots to perceive, reason, and act through a unified multimodal understanding. However, their adversarial robustness remains under-researched, particularly in realistic multimodal and black-box scenarios. This paper introduces VLA-Fool, a study focusing on multimodal adversarial robustness in VLAs, addressing issues like textual and visual perturbations and cross-modal misalignment.

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

arXiv:2511.16175v1 Announce Type: new 
Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $\pi_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

تقدم الورقة مانتيس، وهو نموذج جديد للرؤية-اللغة-العمل (VLA) يستخدم التوقع البصري المنفصل (DVF) لتعزيز قدرات التنبؤ البصري. يعالج مانتيس التحديات في النماذج الحالية لـ VLA، مثل التنبؤ بحالات بصرية عالية الأبعاد واحتقانات المعلومات، من خلال فصل توقع البصر عن الهيكل الأساسي باستخدام استعلامات ميتا ورأس تحويل الانتشار. تهدف هذه الابتكارات إلى تحسين الفهم والتفكير في أنظمة VLA.

El artículo presenta Mantis, un nuevo modelo de Acción-Lenguaje-Visión (VLA) que utiliza la Previsión Visual Desentrelazada (DVF) para mejorar las capacidades de predicción visual. Mantis aborda los desafíos de los modelos VLA existentes, como la predicción de estados visuales de alta dimensión y los cuellos de botella de información, al desacoplar la predicción de la previsión visual del backbone mediante consultas meta y una cabeza de Transformer de difusión. Esta innovación busca mejorar la comprensión y el razonamiento en los sistemas VLA.

L'article présente Mantis, un nouveau modèle Vision-Language-Action (VLA) qui utilise la prévoyance visuelle désentrelacée (DVF) pour améliorer les capacités de prédiction visuelle. Mantis répond aux défis des modèles VLA existants, tels que la prédiction d'états visuels de haute dimension et les goulets d'étranglement d'information, en découplant la prédiction de prévoyance visuelle du backbone à l'aide de requêtes méta et d'une tête de Transformer de diffusion. Cette innovation vise à améliorer la compréhension et le raisonnement dans les systèmes VLA.

The paper introduces Mantis, a new Vision-Language-Action (VLA) model that utilizes Disentangled Visual Foresight (DVF) to enhance visual prediction capabilities. Mantis addresses challenges in existing VLA models, such as high-dimensional visual state prediction and information bottlenecks, by decoupling visual foresight prediction from the backbone using meta queries and a diffusion Transformer head. This innovation aims to improve comprehension and reasoning in VLA systems.

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Was this article worth reading? Share it