DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

arXiv — cs.CL•Monday, November 17, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

DocLens has been introduced as a solution to the challenges faced by Vision
This development signifies a major advancement in AI capabilities, as DocLens, in conjunction with Gemini

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV15 hours ago

Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

PositiveArtificial Intelligence

The paper introduces Risk Semantic Distillation (RSD), a novel framework aimed at enhancing end-to-end autonomous driving (AD) systems. While current AD systems perform well in complex scenarios, they struggle with generalization to unseen situations. RSD leverages Vision-Language Models (VLMs) to improve training efficiency and consistency in trajectory planning, addressing challenges posed by hybrid AD systems that utilize multiple planning approaches. This advancement is crucial for the future of autonomous driving technology.

Read full article

via arXiv — cs.CV

arXiv — cs.CV15 hours ago

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

PositiveArtificial Intelligence

Agentic Video Intelligence (AVI) is a proposed framework designed to enhance video understanding by integrating complex reasoning with visual recognition. Unlike traditional Vision-Language Models (VLMs) that process videos in a single-pass manner, AVI introduces a three-phase reasoning process: Retrieve-Perceive-Review. This approach allows for both global exploration and focused local analysis. Additionally, AVI utilizes a structured video knowledge base organized through entity graphs, aiming to improve video comprehension without extensive training.

Read full article

via arXiv — cs.CV

arXiv — cs.CV15 hours ago

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

PositiveArtificial Intelligence

MedGEN-Bench is a newly introduced multimodal benchmark aimed at enhancing medical AI research, particularly in the context of Vision-Language Models (VLMs). It addresses significant limitations in existing medical visual benchmarks, which often rely on ambiguous queries and oversimplified diagnostic reasoning. MedGEN-Bench includes 6,422 expert-validated image-text pairs across six imaging modalities and 16 clinical tasks, structured to improve the integration of AI-generated images into clinical workflows.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

PositiveArtificial Intelligence

The article discusses the limitations of multimedia documents, which are often distributed in static raster formats, hindering their editability. To address this, a new framework called SliDer is introduced, utilizing Vision-Language Models (VLMs) to convert slide images into editable Scalable Vector Graphics (SVG) representations. This approach aims to preserve the semantic structure of documents, overcoming the shortcomings of traditional raster-vectorization methods that fail to maintain the distinction between image and text elements.

Read full article

via arXiv — cs.LG

arXiv — cs.CL3 days ago

From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

PositiveArtificial Intelligence

The article discusses advancements in fine-tuning Vision-Language Models (VLMs) to enhance spatial reasoning. Traditional methods often suffer from biases and errors due to imbalanced data collection and annotation from real-world scenes. To overcome these issues, the authors propose a redesigned fine-tuning process that includes controlled data generation and annotation, ensuring quality and balance. This approach involves comprehensive sampling of object attributes and aims to improve the transferability of VLMs to real-world applications.

Read full article

via arXiv — cs.CL