FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

arXiv — cs.CV•Monday, November 17, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

FastDriveVLA introduces a reconstruction
The development of FastDriveVLA is significant as it addresses the computational challenges faced by existing Vision
While there are no directly related articles, the emphasis on dataset size and the importance of foreground tokens in decision

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LG16 hours ago

Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

PositiveArtificial Intelligence

The paper discusses the challenges of test-time prompt tuning for vision-language models, highlighting the issue of prompt optimization bias that can lead to suboptimal performance. It identifies two main causes: the model's focus on entropy minimization, which may overlook prediction accuracy, and data misalignment between visual and textual modalities. To address these issues, the authors propose a new method called Doubly Debiased Test-Time Prompt Tuning, aimed at improving model performance in downstream tasks.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

Bridging Hidden States in Vision-Language Models

PositiveArtificial Intelligence

Vision-Language Models (VLMs) are emerging models that integrate visual content with natural language. Current methods typically fuse data either early in the encoding process or late through pooled embeddings. This paper introduces a lightweight fusion module utilizing cross-only, bidirectional attention layers to align hidden states from both modalities, enhancing understanding while keeping encoders non-causal. The proposed method aims to improve the performance of VLMs by leveraging the inherent structure of visual and textual data.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

PositiveArtificial Intelligence

The paper titled 'Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs' discusses the advancements in Vision-Language Models (VLMs) aimed at enhancing personalization. It highlights the challenges posed by the lack of user-provided positive samples and the poor quality of negative samples. To address these issues, the authors introduce the Concept-as-Tree (CaT) framework, which generates diverse positive and negative samples, thus improving VLM performance in personalization tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

NeutralArtificial Intelligence

NeuS-QA is a new neuro-symbolic pipeline designed to enhance Long Video Question Answering (LVQA) by addressing the limitations of traditional vision-language models (VLMs). While VLMs perform well with single images and short videos, they struggle with LVQA due to the need for complex temporal reasoning. NeuS-QA offers a training-free, plug-and-play solution that improves interpretability by ensuring only logic-verified segments are processed by the VLM, thus enhancing the model's ability to understand long-form video content.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Zero-Shot Temporal Interaction Localization for Egocentric Videos

PositiveArtificial Intelligence

The paper titled 'Zero-Shot Temporal Interaction Localization for Egocentric Videos' presents a novel approach called EgoLoc, aimed at improving the localization of human-object interactions in egocentric videos. Traditional methods rely heavily on annotated action and object categories, leading to domain bias and inefficiencies. EgoLoc introduces a self-adaptive sampling strategy to enhance visual prompts for vision-language model reasoning, ultimately achieving better temporal interaction localization.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies

PositiveArtificial Intelligence

The article discusses the introduction of Human-Corrected Labels (HCLs) to improve the quality of labels generated by Vision-Language Models (VLMs). It highlights the issues of low-quality labels and the lack of error correction in VLM outputs. The proposed method involves human intervention to correct discrepancies in VLM-generated labels, leading to enhanced annotation quality and reduced labor costs, supported by extensive experimental results.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

Explainable Deep Convolutional Multi-Type Anomaly Detection

PositiveArtificial Intelligence

The article presents a new approach to anomaly detection called MultiTypeFCDD, which aims to differentiate between various types of anomalies while being computationally efficient. Traditional methods often struggle to classify anomalies accurately and require separate models for each object category, leading to increased costs. MultiTypeFCDD addresses these issues by utilizing image-level labels to generate multi-channel heatmaps, enhancing the specificity of anomaly identification, which is crucial for operational decision-making.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

PositiveArtificial Intelligence

The article introduces MACT, a Multi-Agent Collaboration framework designed to enhance understanding and reasoning in Vision-Language Models (VLMs). It addresses the limitations of monolithic scaling by implementing agent-wise adaptive test-time scaling, which allows for dynamic adjustments based on the functional entities involved in visual document processing. MACT comprises four specialized agents—planning, execution, judgment, and answer—aiming to improve cognitive overload management and ensure factual accuracy through a self-correction loop.

Read full article

via arXiv — cs.CV