A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

arXiv — cs.CLTuesday, November 4, 2025 at 5:00:00 AM
A recent study delves into the biases present in large vision-language models, particularly focusing on chain-of-thought reasoning. This research is significant as it not only examines how these models articulate reasoning but also highlights the impact of both text and image biases on their performance. Understanding these factors is crucial for improving the reliability and transparency of AI systems, ensuring they function more effectively in real-world applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
PositiveArtificial Intelligence
The paper introduces VLM3D, a novel framework that utilizes vision-language models (VLMs) to enhance text-to-3D generation. It addresses two major limitations in current models: the lack of fine-grained semantic alignment and inadequate 3D spatial understanding. VLM3D employs a dual-query critic signal to evaluate both semantic fidelity and geometric coherence, significantly improving the generation process. The framework demonstrates its effectiveness across different paradigms, marking a step forward in 3D generation technology.
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
PositiveArtificial Intelligence
The article presents a new framework called GMAT, which enhances Multiple Instance Learning (MIL) for whole slide image (WSI) classification. By integrating vision-language models (VLMs), GMAT aims to improve the generation of clinical descriptions that are more expressive and medically specific. This addresses limitations in existing methods that rely on large language models (LLMs) for generating descriptions, which often lack domain grounding and detailed medical specificity, thus improving alignment with visual features.
On the Entropy Calibration of Language Models
NeutralArtificial Intelligence
The paper examines entropy calibration in language models, focusing on whether their entropy aligns with log loss on human text. Previous studies indicated that as text generation lengthens, entropy increases while text quality declines, highlighting a fundamental issue in autoregressive models. The authors investigate whether miscalibration can improve with scale and if calibration without tradeoffs is theoretically feasible, analyzing the scaling behavior concerning dataset size and power law exponents.
Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models
PositiveArtificial Intelligence
The paper discusses the challenges of test-time prompt tuning for vision-language models, highlighting the issue of prompt optimization bias that can lead to suboptimal performance in downstream tasks. It identifies two main causes: the model's focus on entropy minimization, which may overlook prediction accuracy, and data misalignment between visual and textual modalities. To address these issues, the authors propose a new method called Doubly Debiased Test-Time Prompt Tuning, aimed at improving model performance in zero-shot settings.
Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs
PositiveArtificial Intelligence
The paper titled 'Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs' discusses the advancements in Vision-Language Models (VLMs) aimed at enhancing personalization. It highlights the challenges posed by the lack of user-provided positive samples and the poor quality of negative samples. To address these issues, the authors introduce the Concept-as-Tree (CaT) framework, which generates diverse positive and negative samples, thus improving VLM performance in personalization tasks.
NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning
NeutralArtificial Intelligence
NeuS-QA is a new neuro-symbolic pipeline designed to enhance Long Video Question Answering (LVQA) by addressing the limitations of traditional vision-language models (VLMs). While VLMs perform well with single images and short videos, they struggle with LVQA due to the need for complex temporal reasoning. NeuS-QA offers a training-free, plug-and-play solution that improves interpretability by ensuring only logic-verified segments are processed by the VLM, thus enhancing the model's ability to understand long-form video content.
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
PositiveArtificial Intelligence
FastDriveVLA is a novel framework designed for efficient end-to-end autonomous driving through a reconstruction-based visual token pruning method. This approach addresses the high computational costs associated with long visual tokens in Vision-Language-Action (VLA) models. By focusing on retaining visual tokens that contain essential foreground information, FastDriveVLA aims to enhance decision-making in driving scenarios, marking a significant advancement in the application of VLA models in autonomous systems.
Transformers know more than they can tell -- Learning the Collatz sequence
NeutralArtificial Intelligence
The study investigates the ability of transformer models to predict long steps in the Collatz sequence, a complex arithmetic function that maps odd integers to their successors. The accuracy of the models varies significantly depending on the base used for encoding, achieving up to 99.7% accuracy for bases 24 and 32, while dropping to 37% and 25% for bases 11 and 3. Despite these variations, all models exhibit a common learning pattern, accurately predicting inputs with similar residuals modulo 2^p.