VAT: Vision Action Transformer by Unlocking Full Representation of ViT

arXiv — cs.CVTuesday, December 9, 2025 at 5:00:00 AM
  • The Vision Action Transformer (VAT) has been introduced as an innovative architecture that enhances the capabilities of Vision Transformers (ViTs) by utilizing the full feature hierarchy, rather than just the final layer's features. This approach allows VAT to process specialized action tokens alongside visual features across all transformer layers, achieving a remarkable 98.15% success rate on LIBERO benchmarks in simulated manipulation tasks.
  • This development is significant as it establishes VAT as a state-of-the-art model for imitation learning, surpassing previous methods like OpenVLA-OFT. By unlocking the complete representation trajectory of vision models, VAT aims to improve robotic policy and action generation, which is crucial for advancing robotic learning and manipulation capabilities.
  • The introduction of VAT aligns with ongoing advancements in Vision-Language-Action (VLA) models, which are increasingly focusing on optimizing visual processing and representation. As various frameworks like Compressor-VLA and MAPS emerge to address inefficiencies and enhance generalization in VLA models, VAT's comprehensive approach underscores the importance of leveraging full visual hierarchies to tackle challenges in robotic manipulation and improve overall model robustness.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Microsoft Tests Copilot-Powered Tool to Modernize JavaScript/TypeScript in VS Code
PositiveArtificial Intelligence
Microsoft has previewed a new tool in VS Code Insiders that leverages GitHub Copilot to modernize JavaScript and TypeScript applications by upgrading npm dependencies and addressing breaking changes. This initiative aims to enhance the development experience for programmers using these languages.
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
PositiveArtificial Intelligence
The introduction of Omniguard presents a novel approach to AI safety moderation by enhancing the detection of harmful prompts across various languages and modalities, addressing the vulnerabilities of large language models (LLMs) to misuse. This method improves classification accuracy by 11.57% over existing baselines, marking a significant advancement in AI safety protocols.
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
PositiveArtificial Intelligence
The introduction of HybridToken-VLM (HTC-VLM) presents a novel approach to hybrid token compression for vision-language models (VLMs), addressing the computational challenges posed by traditional methods that struggle with high memory and context window demands. HTC-VLM utilizes a dual-channel framework to separate fine-grained details and symbolic anchors, achieving an impressive average performance retention of 87.2% across seven benchmarks.
Guiding WaveMamba with Frequency Maps for Image Debanding
PositiveArtificial Intelligence
A new method for image debanding has been proposed, utilizing the Wavelet State Space Model and frequency masking maps to effectively reduce banding artifacts in images, particularly in smooth areas like skies. This technique has shown promising results in suppressing banding compared to existing methods, achieving a DBI value of 0.082 on the BAND-2k dataset.
RAVES-Calib: Robust, Accurate and Versatile Extrinsic Self Calibration Using Optimal Geometric Features
PositiveArtificial Intelligence
A new LiDAR-camera calibration toolkit named RAVES-Calib has been introduced, allowing for robust and accurate extrinsic self-calibration using only a single pair of laser points and a camera image in targetless environments. This method enhances calibration accuracy by adaptively weighting feature costs based on their distribution, validated through extensive experiments across various sensors.
Empowering smart app development with SolidGPT: an edge-cloud hybrid AI agent framework
PositiveArtificial Intelligence
SolidGPT, an open-source edge-cloud hybrid AI agent framework, has been introduced to enhance mobile and software development workflows by integrating Large Language Models (LLMs) while addressing concerns of semantic awareness, developer productivity, and data privacy. This tool allows developers to interactively query their codebases and automate project workflows, significantly improving efficiency.
Open Polymer Challenge: Post-Competition Report
PositiveArtificial Intelligence
The Open Polymer Challenge (OPC) has successfully launched a community-developed benchmark for polymer informatics, releasing a dataset of 10,000 polymers and five key properties. This initiative aims to enhance machine learning applications in discovering sustainable polymer materials, addressing the current limitations posed by the lack of accessible polymer datasets.
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
NeutralArtificial Intelligence
AraLingBench has been introduced as a human-annotated benchmark aimed at evaluating the Arabic linguistic capabilities of large language models (LLMs), covering grammar, morphology, spelling, reading comprehension, and syntax through 150 expert-designed questions. The evaluation of 35 Arabic and bilingual LLMs indicates a disparity between high performance on knowledge-based benchmarks and true linguistic understanding, with many models relying on memorization rather than comprehension.