VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

arXiv — cs.CV•Friday, November 21, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

VLA
This development is crucial as it enhances the efficiency of VLA models, potentially leading to broader applications in embodied AI, where real
The advancement reflects a growing trend in AI research towards improving model efficiency and adaptability, as seen in related frameworks that also seek to refine the interaction between vision and language for better action execution.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.CVa day ago

EvoVLA: Self-Evolving Vision-Language-Action Model

PositiveArtificial Intelligence

EvoVLA is a self-supervised Vision-Language-Action (VLA) framework designed to enhance long-horizon robotic manipulation. It addresses the issue of stage hallucination in current VLA models by incorporating three components: Stage-Aligned Reward (SAR), Pose-Based Object Exploration (POE), and Long-Horizon Memory. These innovations aim to improve task completion accuracy and overall performance in robotic manipulation tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

VisPlay: Self-Evolving Vision-Language Models from Images

PositiveArtificial Intelligence

VisPlay is a self-evolving reinforcement learning framework designed to enhance Vision-Language Models (VLMs) by enabling them to autonomously improve their reasoning capabilities using large amounts of unlabeled image data. It operates by assigning two roles to the model: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together to balance question complexity and answer quality.

Read full article

via arXiv — cs.LG

arXiv — cs.CVa day ago

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

PositiveArtificial Intelligence

The paper presents Rationale-Bootstrapped Fine-Tuning (RB-FT) for enhancing video classification using Vision Language Models (VLMs). It addresses the challenge of limited domain-specific data by proposing a two-stage self-improvement approach. Initially, VLMs generate textual rationales for videos, which are then used for fine-tuning, followed by conventional supervised fine-tuning on task labels, resulting in improved classification effectiveness.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Efficient Architectures for High Resolution Vision-Language Models

PositiveArtificial Intelligence

Recent advancements in Vision-Language Models (VLMs) have been significant, yet challenges remain in accurately recognizing fine details in high-resolution images. The introduction of Pheye, a new architecture, addresses these challenges by efficiently processing high-resolution images while requiring fewer parameters than comparable VLMs. Pheye demonstrates strong performance, particularly in tasks that necessitate fine-grained image understanding and scene-text handling.

Read full article

via arXiv — cs.LG

arXiv — cs.CVa day ago

Learning to Think Fast and Slow for Visual Language Models

PositiveArtificial Intelligence

The article discusses a new approach to visual language models (VLMs) that allows them to switch between fast and slow thinking modes based on task complexity. This method aims to reduce computational costs associated with lengthy reasoning chains by categorizing tasks as requiring either quick or detailed analysis. The proposed reinforcement learning technique enhances decision-making efficiency in VLMs.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

PositiveArtificial Intelligence

HAWAII is a proposed framework aimed at enhancing the efficiency of vision-language models (VLMs) by distilling knowledge from multiple visual experts into a single vision encoder. This approach minimizes computational costs while retaining the strengths of various experts. The framework employs teacher-specific Low-Rank Adaptation (LoRA) adapters to manage knowledge transfer effectively, reducing conflicts and improving performance in visual understanding tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

PositiveArtificial Intelligence

The paper titled 'TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models' discusses a novel method for adapting pre-trained Vision-Language Models (VLMs) to specific tasks using federated learning. This approach aims to reduce communication costs and enhance efficiency by allowing local clients to interact with a central server in a single round, addressing challenges such as data heterogeneity and the underutilization of multimodal information.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

PositiveArtificial Intelligence

The paper discusses the challenges associated with Vision-Language Models (VLMs) in generating lengthy outputs with low information density, which leads to increased energy consumption and costs. It introduces a novel verbose-text induction attack (VTIA) that uses adversarial perturbations to optimize output token length, addressing the limitations of existing methods that merely delay the end of output without maximizing length.

Read full article

via arXiv — cs.CV