arXiv:2510.26466v2 Announce Type: replace-cross 
Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

تتناول دراسة حديثة التحديات المتعلقة بالاختصارات بين الكائنات والسياقات في نماذج الرؤية واللغة، والتي تؤثر على موثوقية التعرف بدون تدريب. من خلال تأطير المشكلة كمشكلة استدلال سببي، يستكشف الباحثون ما إذا كانت التنبؤات تظل صحيحة عندما توضع الكائنات في بيئات مختلفة، باستخدام مساحة تمثيل CLIP لتحليلهم.

Un estudio reciente aborda los desafíos de los atajos de objeto-contexto en los modelos de visión-lenguaje, que afectan la fiabilidad del reconocimiento de cero disparos. Al enmarcar el problema como un problema de inferencia causal, los investigadores exploran si las predicciones se mantienen cuando los objetos se colocan en diferentes entornos, utilizando el espacio de representación de CLIP para su análisis.

Une étude récente aborde les défis des raccourcis objet-contexte dans les modèles de vision-langage, qui affectent la fiabilité de la reconnaissance zéro-shot. En considérant le problème comme une question d'inférence causale, les chercheurs examinent si les prédictions restent valables lorsque les objets sont placés dans des environnements différents, en utilisant l'espace de représentation de CLIP pour leur analyse.

A recent study addresses the challenges of object-context shortcuts in vision-language models, which affect the reliability of zero-shot recognition. By framing the issue as a causal inference problem, the researchers explore whether predictions hold true when objects are placed in different environments, utilizing CLIP's representation space for their analysis.

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

arXiv:2512.15372v1 Announce Type: cross 
Abstract: Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

تم اقتراح نهج جديد يسمى الاسترجاع التكيفي الواعي بتعقيد الصورة (ICAR) لتحسين نماذج الرؤية-اللغة من خلال السماح لمحوّلات الرؤية بتخصيص الموارد الحاسوبية بناءً على تعقيد الصورة. تتيح هذه الطريقة معالجة الصور البسيطة بموارد أقل مع ضمان تحليل الصور المعقدة بالتفصيل، مما يحافظ على التوافق بين الأنماط النصية والصورية لتحقيق مطابقة فعالة.

Se ha propuesto un nuevo enfoque llamado Recuperación Adaptativa Consciente de la Complejidad de la Imagen (ICAR) para mejorar los modelos de visión-lenguaje, permitiendo que los transformadores de visión asignen recursos computacionales según la complejidad de la imagen. Este método permite procesar imágenes simples con menos recursos mientras asegura que las imágenes complejas se analicen en su totalidad, manteniendo la alineación intermodal para un emparejamiento de texto efectivo.

Une nouvelle approche appelée Image Complexity-Aware Retrieval (ICAR) a été proposée pour améliorer les modèles de vision-langage en permettant aux transformateurs de vision d'allouer des ressources informatiques en fonction de la complexité de l'image. Cette méthode permet de traiter des images simples avec moins de ressources tout en garantissant que les images complexes sont analysées en détail, maintenant ainsi l'alignement intermodal pour un appariement texte efficace.

A new approach called Image Complexity-Aware Retrieval (ICAR) has been proposed to enhance vision-language models by allowing vision transformers to allocate computational resources based on image complexity. This method enables simpler images to be processed with less compute while ensuring that complex images are analyzed in full detail, maintaining cross-modal alignment for effective text matching.

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

One More Thing in AI – Your Shortcut to AI Mastery

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Cococlip.AI

The Visualizer

Cont3xt.dev

LCW

Ready to build your own newsroom?