arXiv:2511.17053v1 Announce Type: new 
Abstract: LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

تقديم OmniPT، إطار موحد جديد لتتبع المشاة، يستفيد من قدرات نماذج اللغة البصرية الكبيرة (LVLM) لتعزيز تتبع وفهم الكائنات من خلال معالجة دلالية متقدمة. يعالج هذا الإطار الفجوات الحالية في الأداء في المهام على مستوى الكائن، مثل التثبيت البصري واكتشاف الكائنات، والتي كانت تقليديًا تحت سيطرة نماذج الخبراء.

La introducción de OmniPT, un nuevo marco unificado para el seguimiento de peatones, aprovecha las capacidades de los Modelos de Lenguaje Visual de Gran Escala (LVLM) para mejorar el seguimiento y la comprensión de objetos a través de un procesamiento semántico avanzado. Este marco aborda las brechas de rendimiento existentes en tareas a nivel de instancia, como la anclaje visual y la detección de objetos, que tradicionalmente han sido dominadas por modelos expertos.

L'introduction d'OmniPT, un nouveau cadre unifié pour le suivi des piétons, exploite les capacités des grands modèles de langage visuel (LVLM) pour améliorer le suivi et la compréhension des objets grâce à un traitement sémantique avancé. Ce cadre aborde les lacunes de performance existantes dans des tâches de niveau instance telles que le ancrage visuel et la détection d'objets, qui ont traditionnellement été dominées par des modèles experts.

The introduction of OmniPT, a new unified framework for pedestrian tracking, leverages the capabilities of Large Vision Language Models (LVLMs) to enhance object tracking and understanding through advanced semantic processing. This framework addresses existing performance gaps in instance-level tasks like visual grounding and object detection, which have traditionally been dominated by expert models.

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

arXiv:2511.17254v1 Announce Type: new 
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

تقدم دراسة جديدة إطار العمل Intervene-All-Paths، الذي يهدف إلى التخفيف من الهلوسات في نماذج اللغة الكبيرة للرؤية (LVLMs) من خلال معالجة تفاعل مسارات سببية متنوعة. تبرز هذه البحث أن الهلوسات تنشأ من مصادر متعددة، بما في ذلك التفاعلات من الصورة إلى نص الإدخال ومن النص إلى النص، وتقترح تدخلات مستهدفة لأشكال محاذاة الأسئلة والأجوبة المختلفة.

Un nuevo estudio presenta el marco Intervene-All-Paths, destinado a mitigar las alucinaciones en los Modelos de Lenguaje de Visión Grande (LVLMs) al abordar la interacción de diversas rutas causales. Esta investigación destaca que las alucinaciones provienen de múltiples fuentes, incluidas las interacciones de imagen a texto de entrada y de texto a texto, y propone intervenciones específicas para diferentes formatos de alineación de preguntas y respuestas.

Une nouvelle étude présente le cadre Intervene-All-Paths, visant à atténuer les hallucinations dans les grands modèles de vision-langage (LVLMs) en abordant l'interaction de divers chemins causaux. Cette recherche met en évidence que les hallucinations proviennent de multiples sources, y compris les interactions image-texte d'entrée et texte-texte, et propose des interventions ciblées pour différents formats d'alignement question-réponse.

A new study introduces the Intervene-All-Paths framework, aimed at mitigating hallucinations in Large Vision-Language Models (LVLMs) by addressing the interplay of various causal pathways. This research highlights that hallucinations stem from multiple sources, including image-to-input-text and text-to-text interactions, and proposes targeted interventions for different question-answer alignment formats.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

arXiv:2511.11005v2 Announce Type: replace 
Abstract: While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems. Code is available at https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts.

أدت التطورات الأخيرة في نماذج اللغة والرؤية الكبيرة (LVLM) إلى تقديم إطار العمل Draft and Refine (DnR)، الذي يعزز قدرات التفكير في النماذج من خلال قياس اعتمادها على الأدلة البصرية عبر مقياس استخدام مشروط بالأسئلة. تهدف هذه الطريقة إلى تقليل الاستجابات غير المستندة أو الوهمية من خلال تحسين المسودات الأولية باستخدام تعليقات مستهدفة من خبراء بصريين.

Los recientes avances en los Modelos de Lenguaje-Visión de Gran Escala (LVLM) han llevado a la introducción del marco Draft and Refine (DnR), que mejora las capacidades de razonamiento de los modelos al cuantificar su dependencia de la evidencia visual a través de una métrica de utilización condicionada por preguntas. Este enfoque busca reducir las respuestas infundadas o alucinadas refinando los borradores iniciales con comentarios específicos de expertos visuales.

Des avancées récentes dans les modèles de langage-vision de grande taille (LVLM) ont conduit à l'introduction du cadre Draft and Refine (DnR), qui améliore les capacités de raisonnement des modèles en quantifiant leur dépendance à l'égard des preuves visuelles grâce à une métrique d'utilisation conditionnée par des questions. Cette approche vise à réduire les réponses non fondées ou hallucinées en affinant les brouillons initiaux avec des retours ciblés d'experts visuels.

Recent advancements in Large Vision-Language Models (LVLMs) have led to the introduction of the Draft and Refine (DnR) framework, which enhances the models' reasoning capabilities by quantifying their reliance on visual evidence through a question-conditioned utilization metric. This approach aims to reduce ungrounded or hallucinated responses by refining initial drafts with targeted feedback from visual experts.

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Was this article worth reading? Share it