DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

The introduction of DocPTBench marks a significant advancement in the benchmarking of end-to-end photographed document parsing and translation, addressing the limitations of existing benchmarks that primarily focus on pristine scanned documents. This new benchmark includes over 1,300 high-resolution photographed documents and eight translation scenarios, with human-verified annotations for improved accuracy.
This development is crucial as it highlights the performance decline of popular Multimodal Large Language Models (MLLMs) when transitioning from digital-born to photographed documents, emphasizing the need for more robust evaluation methods in real-world conditions.
The emergence of DocPTBench reflects a broader trend in AI research, where there is a growing recognition of the challenges posed by real-world data, including geometric distortions and photometric variations. This aligns with ongoing efforts to enhance the robustness of MLLMs across various applications, including video question answering and social interaction assessments.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Monkt

Convert documents to AI-ready Markdown or structured JSON for development workflows.

Business & ProductivityTry the app

MarkupGo

Convert images and PDFs with a single, reliable API for developers.

Business & ProductivityTry the app

Supametas.AI

Extract and structure unstructured data for seamless LLM RAG integration.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

PositiveArtificial Intelligence

A new algorithm named MM-Det++ has been proposed to enhance the detection of videos generated by diffusion models, addressing the growing concerns over synthetic media and information security. This algorithm integrates a Spatio-Temporal branch utilizing a Frame-Centric Vision Transformer and a Multimodal branch for improved detection capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

NeutralArtificial Intelligence

The introduction of BackdoorVLM marks a significant advancement in the evaluation of backdoor attacks on vision-language models (VLMs), addressing a critical gap in the understanding of these threats within multimodal machine learning systems. This benchmark categorizes backdoor threats into five distinct types, including targeted refusal and perceptual hijack, providing a structured approach to analyze their impact on tasks like image captioning and visual question answering.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

MINDiff: Mask-Integrated Negative Attention for Controlling Overfitting in Text-to-Image Personalization

PositiveArtificial Intelligence

A new method called Mask-Integrated Negative Attention Diffusion (MINDiff) has been proposed to tackle overfitting in text-to-image personalization, particularly when learning from limited images. This approach introduces negative attention to suppress subject influence in irrelevant areas, enhancing semantic control and text alignment during inference. Users can adjust a scale parameter to balance subject fidelity and text alignment.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design

PositiveArtificial Intelligence

A new study has introduced an innovative method for interpreting Few-Shot Semantic Segmentation (FSS) models, which are designed to segment novel classes with minimal labeled examples. The Affinity Explainer approach utilizes structural properties of matching-based FSS models to generate attribution maps, highlighting the contribution of support images to query segmentation predictions.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

PositiveArtificial Intelligence

The introduction of Neighborhood Attention Filtering (NAF) represents a significant advancement in the field of Vision Foundation Models (VFMs), allowing for zero-shot feature upsampling without the need for retraining. This innovative method utilizes Cross-Scale Neighborhood Attention and Rotary Position Embeddings to adaptively learn spatial and content weights from high-resolution images, outperforming existing VFM-specific upsamplers across various tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

PositiveArtificial Intelligence

The PoETa v2 benchmark has been introduced as the most extensive evaluation of Large Language Models (LLMs) for the Portuguese language, comprising over 40 tasks. This initiative aims to systematically assess more than 20 models, highlighting performance variations influenced by computational resources and language-specific adaptations. The benchmark is accessible on GitHub.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Importance-Weighted Non-IID Sampling for Flow Matching Models

PositiveArtificial Intelligence

A new framework for importance-weighted non-IID sampling has been proposed to enhance flow-matching models, which are crucial for accurately representing complex distributions. This method addresses the challenge of estimating expectations from limited samples, particularly in scenarios where rare outcomes significantly influence results.

Read full article

via arXiv — cs.LG