An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

arXiv — cs.CV•Friday, November 21, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of a verbose
The development of VTIA is crucial as it provides a more effective method for controlling output length, potentially leading to more efficient VLM applications in various fields, including document understanding and video intelligence.
This advancement reflects a broader trend in AI research focusing on improving the efficiency and effectiveness of VLMs, as seen in various frameworks designed to enhance visual reasoning and document understanding, indicating an ongoing commitment to addressing the limitations of traditional models.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.LG2 days ago

An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text

NeutralArtificial Intelligence

The article presents an interpretability-guided framework for generating synthetic data in emotional text analysis, addressing the challenges of high costs and restrictions in accessing training data. Utilizing Shapley Additive Explanations (SHAP), the framework enhances the performance of large language models (LLMs) in emotion classification, particularly for underrepresented classes. However, it notes limitations in vocabulary richness and expression complexity in synthetic texts compared to authentic social media posts.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

PositiveArtificial Intelligence

QueryGym is a new Python toolkit designed for large language model (LLM)-based query reformulation. It aims to provide a unified framework that enhances retrieval effectiveness by allowing consistent implementation, execution, and comparison of various LLM-based methods. The toolkit includes a Python API, a retrieval-agnostic interface for integration with backends like Pyserini and PyTerrier, and a centralized prompt management system.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

PositiveArtificial Intelligence

HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models) is a proposed framework aimed at enhancing the efficiency of Vision-Language Models (VLMs). It addresses the computational challenges posed by visual tokens by allowing the drafter to obtain visual information without explicitly processing these tokens. This approach ensures that the drafter's prefill sequence length aligns with that of the textual tokens, potentially improving inference speed and quality.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

NeutralArtificial Intelligence

Recent advancements in language models have led to the development of Large Reasoning Models (LRMs) that articulate detailed thought processes before arriving at conclusions. Despite their enhanced performance on reasoning benchmarks, the fundamental capabilities and limitations of these models are not yet fully understood. Evaluations have primarily focused on final answer accuracy, often overlooking the reasoning processes involved.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

PositiveArtificial Intelligence

The paper introduces FlipVQA-Miner, an automated pipeline designed to extract high-quality question-answer (QA) and visual question-answer (VQA) pairs from educational documents. This method combines layout-aware OCR with large language model (LLM)-based semantic parsing, addressing the challenge of transforming raw PDFs into AI-ready supervision. Experiments demonstrate that the approach yields accurate and aligned QA/VQA pairs, enhancing the utility of educational materials for training LLMs.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

PositiveArtificial Intelligence

The paper introduces T2T-VICL, a collaborative pipeline designed to explore cross-task visual in-context learning (VICL) using vision-language models (VLMs). It focuses on generating and selecting text prompts that effectively describe differences between distinct low-level vision tasks. The study also presents a novel inference framework that integrates perceptual reasoning with traditional evaluation metrics, aiming to enhance the capabilities of VLMs in handling diverse visual tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

PositiveArtificial Intelligence

The paper presents Rationale-Bootstrapped Fine-Tuning (RB-FT) for enhancing video classification using Vision Language Models (VLMs). It addresses the challenge of limited domain-specific data by proposing a two-stage self-improvement approach. Initially, VLMs generate textual rationales for videos, which are then used for fine-tuning, followed by conventional supervised fine-tuning on task labels, resulting in improved classification effectiveness.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

VisPlay: Self-Evolving Vision-Language Models from Images

PositiveArtificial Intelligence

VisPlay is a self-evolving reinforcement learning framework designed to enhance Vision-Language Models (VLMs) by enabling them to autonomously improve their reasoning capabilities using large amounts of unlabeled image data. It operates by assigning two roles to the model: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together to balance question complexity and answer quality.

Read full article

via arXiv — cs.LG