BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer

arXiv — cs.CVThursday, November 20, 2025 at 5:00:00 AM
  • The introduction of BBox DocVQA marks a significant advancement in Document Visual Question Answering by providing a dataset that enhances spatial reasoning and evidence localization. This dataset includes 3.6K documents and 32K QA pairs, addressing limitations in existing datasets that lack fine
  • This development is crucial for improving the capabilities of Vision Language Models, which are essential for multimodal document understanding and reasoning tasks. Enhanced spatial reasoning can lead to more accurate interpretations of visual documents.
  • The challenges faced by Vision Language Models in real
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations
PositiveArtificial Intelligence
A novel framework called Topic-level Preference Rewriting (TPR) has been introduced to systematically optimize reward gaps in Vision Language Models (VLMs), addressing the challenges of hallucinations during data curation. This method focuses on selectively replacing semantic topics within VLM responses to enhance the accuracy of generated outputs.
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
NeutralArtificial Intelligence
A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.
PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry
PositiveArtificial Intelligence
PsychiatryBench has been introduced as a comprehensive benchmark for evaluating large language models (LLMs) in the field of psychiatry, consisting of 5,188 expert-annotated items across eleven distinct question-answering tasks. This initiative aims to enhance diagnostic reasoning, treatment planning, and clinical management in psychiatric practice.
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
PositiveArtificial Intelligence
The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
NeutralArtificial Intelligence
Recent research indicates that Vision Language Models (VLMs) often exhibit biases learned during training, particularly when tasked with specific queries about visual properties, such as counting objects in images. A new synthetic benchmark dataset and evaluation framework have been developed to assess how counting performance varies with different image and prompt characteristics.
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
PositiveArtificial Intelligence
A new dataset named BOP-ASK has been introduced to enhance object-interaction reasoning in Vision Language Models (VLMs). This dataset addresses the limitations of existing benchmarks that focus on high-level spatial relationships while neglecting fine-grained spatial understanding necessary for real-world applications. BOP-ASK includes over 150,000 images and 33 million questions, derived from detailed 6D object poses and annotations.