Ultra-Light Test-Time Adaptation for Vision--Language Models

arXiv — cs.CVThursday, November 13, 2025 at 5:00:00 AM
The introduction of Ultra-Light Test-Time Adaptation (UL-TTA) marks a significant advancement in the field of Vision-Language Models (VLMs), particularly in addressing challenges like feature drift and miscalibration during domain shifts. Unlike existing test-time adaptation methods that rely on complex backpropagation and heavy memory usage, UL-TTA operates in a fully training-free manner, adapting only logit-level parameters. This innovative approach has demonstrated a notable improvement in performance, achieving an average increase of 4.7 points in top-1 accuracy over zero-shot CLIP models and a reduction in expected calibration error (ECE) by 20-30%. The method's effectiveness was validated across extensive benchmarks, including PACS and DomainNet, involving 726,000 test samples. The implications of UL-TTA are particularly relevant for real-time applications in streaming and edge computing environments, where traditional methods may not be feasible.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models
PositiveArtificial Intelligence
The paper discusses the challenges of test-time prompt tuning for vision-language models, highlighting the issue of prompt optimization bias that can lead to suboptimal performance. It identifies two main causes: the model's focus on entropy minimization, which may overlook prediction accuracy, and data misalignment between visual and textual modalities. To address these issues, the authors propose a new method called Doubly Debiased Test-Time Prompt Tuning, aimed at improving model performance in downstream tasks.
Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning
PositiveArtificial Intelligence
The paper titled 'Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning' addresses the challenges of class-incremental learning (CIL) in vision-language models like CLIP. It introduces a two-stage framework called DMC, which separates the adaptation of the vision encoder from the optimization of textual soft prompts. This approach aims to mitigate classifier bias and maintain cross-modal alignment, enhancing the model's ability to learn new categories without forgetting previously acquired knowledge.
Bridging Hidden States in Vision-Language Models
PositiveArtificial Intelligence
Vision-Language Models (VLMs) are emerging models that integrate visual content with natural language. Current methods typically fuse data either early in the encoding process or late through pooled embeddings. This paper introduces a lightweight fusion module utilizing cross-only, bidirectional attention layers to align hidden states from both modalities, enhancing understanding while keeping encoders non-causal. The proposed method aims to improve the performance of VLMs by leveraging the inherent structure of visual and textual data.
CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening
PositiveArtificial Intelligence
The article presents CLIPPan, an unsupervised pansharpening framework that utilizes CLIP, a visual-language model, as a supervisor. This approach addresses the challenges faced by supervised pansharpening methods, particularly the domain adaptation issues arising from the disparity between simulated low-resolution training data and real-world high-resolution scenarios. The framework is designed to improve the understanding of the pansharpening process and enhance the model's ability to recognize various image types, ultimately setting a new state of the art in unsupervised full-resolution pans…
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
PositiveArtificial Intelligence
FastDriveVLA is a novel framework designed for efficient end-to-end autonomous driving through a reconstruction-based visual token pruning method. This approach addresses the high computational costs associated with long visual tokens in Vision-Language-Action (VLA) models. By focusing on retaining visual tokens that contain essential foreground information, FastDriveVLA aims to enhance decision-making in driving scenarios, marking a significant advancement in the application of VLA models in autonomous systems.
NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion
PositiveArtificial Intelligence
The article introduces NP-LoRA, a novel framework for Low-Rank Adaptation (LoRA) fusion that addresses the issue of interference in existing methods. Traditional weight-based merging often leads to one LoRA dominating another, resulting in degraded fidelity. NP-LoRA utilizes a projection-based approach to maintain subspace separation, thereby enhancing the quality of fusion by preventing structural interference among principal directions.
Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs
PositiveArtificial Intelligence
The paper titled 'Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs' discusses the advancements in Vision-Language Models (VLMs) aimed at enhancing personalization. It highlights the challenges posed by the lack of user-provided positive samples and the poor quality of negative samples. To address these issues, the authors introduce the Concept-as-Tree (CaT) framework, which generates diverse positive and negative samples, thus improving VLM performance in personalization tasks.
Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies
PositiveArtificial Intelligence
The article discusses the introduction of Human-Corrected Labels (HCLs) to improve the quality of labels generated by Vision-Language Models (VLMs). It highlights the issues of low-quality labels and the lack of error correction in VLM outputs. The proposed method involves human intervention to correct discrepancies in VLM-generated labels, leading to enhanced annotation quality and reduced labor costs, supported by extensive experimental results.