BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer

arXiv — cs.CV•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of BBox DocVQA marks a significant advancement in Document Visual Question Answering by providing a dataset that enhances spatial reasoning and evidence localization. This dataset includes 3.6K documents and 32K QA pairs, addressing limitations in existing datasets that lack fine
This development is crucial for improving the capabilities of Vision Language Models, which are essential for multimodal document understanding and reasoning tasks. Enhanced spatial reasoning can lead to more accurate interpretations of visual documents.
The challenges faced by Vision Language Models in real

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Botgenuity

Build custom AI chatbots that understand and communicate using your specific data.

AI & DataTry the app

MagicBox

Find free AI tools for development with MagicBox's curated directory.

Business & ProductivityTry the app

GPTBox

ChatGPT and auto-type in any Windows app for instant AI assistance.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

PositiveArtificial Intelligence

A novel framework called Topic-level Preference Rewriting (TPR) has been introduced to systematically optimize reward gaps in Vision Language Models (VLMs), addressing the challenges of hallucinations during data curation. This method focuses on selectively replacing semantic topics within VLM responses to enhance the accuracy of generated outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

NeutralArtificial Intelligence

A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry

PositiveArtificial Intelligence

PsychiatryBench has been introduced as a comprehensive benchmark for evaluating large language models (LLMs) in the field of psychiatry, consisting of 5,188 expert-annotated items across eleven distinct question-answering tasks. This initiative aims to enhance diagnostic reasoning, treatment planning, and clinical management in psychiatric practice.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

PositiveArtificial Intelligence

The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

NeutralArtificial Intelligence

Recent research indicates that Vision Language Models (VLMs) often exhibit biases learned during training, particularly when tasked with specific queries about visual properties, such as counting objects in images. A new synthetic benchmark dataset and evaluation framework have been developed to assess how counting performance varies with different image and prompt characteristics.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

PositiveArtificial Intelligence

A new dataset named BOP-ASK has been introduced to enhance object-interaction reasoning in Vision Language Models (VLMs). This dataset addresses the limitations of existing benchmarks that focus on high-level spatial relationships while neglecting fine-grained spatial understanding necessary for real-world applications. BOP-ASK includes over 150,000 images and 33 million questions, derived from detailed 6D object poses and annotations.

Read full article

via arXiv — cs.CV