Ultra-Light Test-Time Adaptation for Vision--Language Models

arXiv — cs.CVThursday, November 13, 2025 at 5:00:00 AM
The introduction of Ultra-Light Test-Time Adaptation (UL-TTA) marks a significant advancement in the field of Vision-Language Models (VLMs), particularly in addressing challenges like feature drift and miscalibration during domain shifts. Unlike existing test-time adaptation methods that rely on complex backpropagation and heavy memory usage, UL-TTA operates in a fully training-free manner, adapting only logit-level parameters. This innovative approach has demonstrated a notable improvement in performance, achieving an average increase of 4.7 points in top-1 accuracy over zero-shot CLIP models and a reduction in expected calibration error (ECE) by 20-30%. The method's effectiveness was validated across extensive benchmarks, including PACS and DomainNet, involving 726,000 test samples. The implications of UL-TTA are particularly relevant for real-time applications in streaming and edge computing environments, where traditional methods may not be feasible.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification
PositiveArtificial Intelligence
A new framework for cascading multi-agent anomaly detection in surveillance systems has been introduced, utilizing vision-language models and embedding-based classification to enhance real-time performance and semantic interpretability. This approach integrates various methodologies, including reconstruction-gated filtering and object-level assessments, to address the complexities of detecting anomalies in dynamic visual environments.
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
NeutralArtificial Intelligence
The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about