CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

arXiv — cs.CVWednesday, November 19, 2025 at 5:00:00 AM
  • CORE has been introduced as a new paradigm for visual token compression in Large Vision
  • This development is significant as it establishes a new state
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
PositiveArtificial Intelligence
DocSLM is a Small Vision-Language Model designed for efficient long-document understanding, addressing the limitations of Large Vision-Language Models (LVLMs) that require substantial memory. It features a Hierarchical Multimodal Compressor that encodes visual, textual, and layout information into a compact sequence, reducing memory usage while maintaining semantic integrity. Additionally, a Streaming Abstention mechanism allows for scalable processing of lengthy documents by filtering low-confidence responses.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
PositiveArtificial Intelligence
MVI-Bench is introduced as a comprehensive benchmark aimed at evaluating the robustness of Large Vision-Language Models (LVLMs) against misleading visual inputs. Traditional benchmarks have primarily focused on textual inputs, neglecting the significant impact of visual misrepresentation. MVI-Bench categorizes misleading visual inputs into three hierarchical levels: Visual Concept, Visual Attribute, and Visual Relationship, and includes 1,248 annotated Visual Question Answering (VQA) instances to facilitate detailed robustness assessments.