Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

arXiv — cs.CV•Monday, November 24, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new approach called Query-aware Token Selector (QTSplus) has been introduced to enhance long-video understanding in multimodal large language models (MLLMs). This module addresses the challenge of increasing vision token counts with video length, which leads to higher attention costs and latency. QTSplus dynamically selects the most relevant visual tokens based on text queries, improving efficiency in processing long videos.
The development of QTSplus is significant as it enables MLLMs to better handle long-form video content, which is crucial for applications in various fields such as education, entertainment, and surveillance. By optimizing the selection of visual information, QTSplus enhances the models' ability to deliver accurate and contextually relevant outputs.
This advancement reflects a broader trend in AI towards improving the efficiency and effectiveness of multimodal models. As the demand for sophisticated video analysis grows, the integration of techniques like QTSplus and other emerging frameworks highlights the ongoing efforts to refine AI capabilities, particularly in areas requiring real-time processing and decision-making.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.CVa day ago

Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

PositiveArtificial Intelligence

A new approach called AutoV has been introduced to enhance the performance of large vision-language models (LVLMs) by automatically selecting optimal visual prompts based on textual queries and input images. This method addresses the challenges of manually designing effective visual prompts, which can be time-consuming and often lead to sub-optimal results.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

PositiveArtificial Intelligence

A new framework named ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the evaluation of multiple-choice question answering (MCQA) by transforming questions into open-form formats while maintaining verifiability. This approach aims to address the limitations of traditional MCQA, which can lead to unreliable accuracy metrics due to answer guessing behaviors during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

PositiveArtificial Intelligence

EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.

Read full article

via arXiv — cs.CV