DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

arXiv — cs.CVWednesday, November 19, 2025 at 5:00:00 AM
  • DocSLM has been introduced as an efficient Small Vision
  • The development of DocSLM is crucial as it enables deployment on resource
  • The introduction of DocSLM aligns with ongoing efforts to improve the efficiency of AI models, particularly in the context of visual and textual data processing. This trend reflects a broader movement towards optimizing AI technologies to ensure they can operate effectively in diverse settings, addressing the challenges posed by traditional models that often struggle with high memory demands.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
PositiveArtificial Intelligence
The article introduces CORE (Compact Object-centric REpresentations), a novel approach to visual token compression in Large Vision-Language Models (LVLMs). Traditional token compression methods often struggle with high computational and memory costs due to the quadratic increase in visual tokens with image resolution. CORE utilizes an efficient segmentation decoder to create object masks, providing a semantic framework for merging visual tokens into compact representations. Additionally, a centroid-guided sorting mechanism ensures the spatial order of tokens is maintained, enhancing the overal…
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
PositiveArtificial Intelligence
MVI-Bench is introduced as a comprehensive benchmark aimed at evaluating the robustness of Large Vision-Language Models (LVLMs) against misleading visual inputs. Traditional benchmarks have primarily focused on textual inputs, neglecting the significant impact of visual misrepresentation. MVI-Bench categorizes misleading visual inputs into three hierarchical levels: Visual Concept, Visual Attribute, and Visual Relationship, and includes 1,248 annotated Visual Question Answering (VQA) instances to facilitate detailed robustness assessments.