BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer

arXiv — cs.CVThursday, November 20, 2025 at 5:00:00 AM
  • The introduction of BBox DocVQA marks a significant advancement in Document Visual Question Answering by providing a dataset that enhances spatial reasoning and evidence localization. This dataset includes 3.6K documents and 32K QA pairs, addressing limitations in existing datasets that lack fine
  • This development is crucial for improving the capabilities of Vision Language Models, which are essential for multimodal document understanding and reasoning tasks. Enhanced spatial reasoning can lead to more accurate interpretations of visual documents.
  • The challenges faced by Vision Language Models in real
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs
NeutralArtificial Intelligence
A new framework named HazardForge has been introduced to enhance the evaluation of Vision Language Models (VLMs) in autonomous vehicles and mobile systems, addressing the inadequacy of existing benchmarks in simulating diverse hazardous scenarios. This framework includes the MovSafeBench, a benchmark with 7,254 images and corresponding question-answer pairs across 13 object categories.
Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
PositiveArtificial Intelligence
A new study has introduced a subject decoupling framework for zero-shot distracted driver detection using Vision Language Models (VLMs). This approach aims to improve the accuracy of detecting driver distractions by separating appearance factors from behavioral cues, addressing a significant limitation in existing VLM-based systems.
Decentralized Autoregressive Generation
NeutralArtificial Intelligence
A theoretical analysis of decentralization in autoregressive generation has been presented, introducing the Decentralized Discrete Flow Matching objective, which expresses probability generating velocity as a linear combination of expert flows. Experiments demonstrate the equivalence between decentralized and centralized training settings for multimodal language models, specifically comparing LLaVA and InternVL 2.5-1B.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about