DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

arXiv — cs.CVTuesday, November 25, 2025 at 5:00:00 AM
  • The introduction of DocPTBench marks a significant advancement in the benchmarking of end-to-end photographed document parsing and translation, addressing the limitations of existing benchmarks that primarily focus on pristine scanned documents. This new benchmark includes over 1,300 high-resolution photographed documents and eight translation scenarios, with human-verified annotations for improved accuracy.
  • This development is crucial as it highlights the performance decline of popular Multimodal Large Language Models (MLLMs) when transitioning from digital-born to photographed documents, emphasizing the need for more robust evaluation methods in real-world conditions.
  • The emergence of DocPTBench reflects a broader trend in AI research, where there is a growing recognition of the challenges posed by real-world data, including geometric distortions and photometric variations. This aligns with ongoing efforts to enhance the robustness of MLLMs across various applications, including video question answering and social interaction assessments.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
PositiveArtificial Intelligence
A new algorithm named MM-Det++ has been proposed to enhance the detection of videos generated by diffusion models, addressing the growing concerns over synthetic media and information security. This algorithm integrates a Spatio-Temporal branch utilizing a Frame-Centric Vision Transformer and a Multimodal branch for improved detection capabilities.
BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models
NeutralArtificial Intelligence
The introduction of BackdoorVLM marks a significant advancement in the evaluation of backdoor attacks on vision-language models (VLMs), addressing a critical gap in the understanding of these threats within multimodal machine learning systems. This benchmark categorizes backdoor threats into five distinct types, including targeted refusal and perceptual hijack, providing a structured approach to analyze their impact on tasks like image captioning and visual question answering.
MINDiff: Mask-Integrated Negative Attention for Controlling Overfitting in Text-to-Image Personalization
PositiveArtificial Intelligence
A new method called Mask-Integrated Negative Attention Diffusion (MINDiff) has been proposed to tackle overfitting in text-to-image personalization, particularly when learning from limited images. This approach introduces negative attention to suppress subject influence in irrelevant areas, enhancing semantic control and text alignment during inference. Users can adjust a scale parameter to balance subject fidelity and text alignment.
Multi-speaker Attention Alignment for Multimodal Social Interaction
PositiveArtificial Intelligence
A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.
Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design
PositiveArtificial Intelligence
A new study has introduced an innovative method for interpreting Few-Shot Semantic Segmentation (FSS) models, which are designed to segment novel classes with minimal labeled examples. The Affinity Explainer approach utilizes structural properties of matching-based FSS models to generate attribution maps, highlighting the contribution of support images to query segmentation predictions.
NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
PositiveArtificial Intelligence
The introduction of Neighborhood Attention Filtering (NAF) represents a significant advancement in the field of Vision Foundation Models (VFMs), allowing for zero-shot feature upsampling without the need for retraining. This innovative method utilizes Cross-Scale Neighborhood Attention and Rotary Position Embeddings to adaptively learn spatial and content weights from high-resolution images, outperforming existing VFM-specific upsamplers across various tasks.
PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese
PositiveArtificial Intelligence
The PoETa v2 benchmark has been introduced as the most extensive evaluation of Large Language Models (LLMs) for the Portuguese language, comprising over 40 tasks. This initiative aims to systematically assess more than 20 models, highlighting performance variations influenced by computational resources and language-specific adaptations. The benchmark is accessible on GitHub.
Importance-Weighted Non-IID Sampling for Flow Matching Models
PositiveArtificial Intelligence
A new framework for importance-weighted non-IID sampling has been proposed to enhance flow-matching models, which are crucial for accurately representing complex distributions. This method addresses the challenge of estimating expectations from limited samples, particularly in scenarios where rare outcomes significantly influence results.