arXiv:2511.16150v1 Announce Type: new 
Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.

تقدم الورقة المعنونة 'Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval' نهجًا جديدًا لتحسين التضمينات متعددة الوسائط من خلال دمج التفكير من نماذج اللغة متعددة الوسائط (MLLMs). تقترح الطريقة المعروفة باسم التضمينات الموجهة بالتفكير (RGE) الجمع بين توليد المبررات الهيكلية والتدريب التبايني، مما يحسن جودة التمثيلات المستخرجة من MLLMs. تهدف هذه التطورات إلى تحسين مهام الاسترجاع متعددة الوسائط.

El artículo titulado 'Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval' presenta un enfoque novedoso para mejorar los embeddings multimodales al incorporar el razonamiento de los Modelos de Lenguaje Multimodal (MLLMs). El método propuesto, Embeddings Guiados por el Razonamiento (RGE), combina la generación de racionales estructurados con el entrenamiento contrastivo, mejorando así la calidad de las representaciones extraídas de los MLLMs. Este avance busca optimizar las tareas de recuperación multimodal.

L'article intitulé 'Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval' présente une approche novatrice pour améliorer les embeddings multimodaux en intégrant le raisonnement des Modèles de Langage Multimodaux (MLLMs). La méthode proposée, les Embeddings Guidés par le Raisonnement (RGE), associe la génération de rationales structurées à un entraînement contrastif, améliorant ainsi la qualité des représentations extraites des MLLMs. Cette avancée vise à optimiser les tâches de récupération multimodale.

The paper titled 'Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval' introduces a novel approach to enhance multimodal embeddings by incorporating reasoning from Multimodal Large Language Models (MLLMs). The proposed Reasoning Guided Embeddings (RGE) method couples structured rationale generation with contrastive training, improving the quality of representations extracted from MLLMs. This advancement aims to optimize multimodal retrieval tasks.

Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

arXiv:2511.12861v3 Announce Type: replace-cross 
Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.

أظهرت التطورات الأخيرة في نماذج اللغة متعددة الوسائط (MLLM) الحاجة إلى تعزيز قدراتها على التفكير، وخاصة من خلال نموذج Chain-of-Thought (CoT). تهدف هذه المقاربة إلى تحسين شفافية التفكير وقابلية التفسير، مع معالجة التحديات الحالية مثل مسارات التفكير غير الواضحة والقدرات المحدودة على التعميم. توفر المراجعة المنهجية لطرق Multimodal Chain-of-Thought (MCoT) رؤى حول أسسها النظرية وتطبيقاتها العملية.

Los recientes avances en Modelos de Lenguaje Multimodal (MLLM) han resaltado la necesidad de mejorar sus capacidades de razonamiento, particularmente a través del paradigma Chain-of-Thought (CoT). Este enfoque busca mejorar la transparencia del razonamiento y la interpretabilidad, abordando desafíos existentes como los caminos de razonamiento opacos y las limitaciones en la capacidad de generalización. La revisión sistemática de los métodos de Multimodal Chain-of-Thought (MCoT) proporciona información sobre sus fundamentos teóricos y aplicaciones prácticas.

Les avancées récentes dans les modèles de langage multimodaux (MLLM) ont mis en évidence la nécessité d'améliorer leurs capacités de raisonnement, notamment par le biais du paradigme Chain-of-Thought (CoT). Cette approche vise à améliorer la transparence et l'interprétabilité du raisonnement, en s'attaquant aux défis existants tels que les chemins de raisonnement opaques et les capacités de généralisation limitées. La revue systématique des méthodes de Multimodal Chain-of-Thought (MCoT) fournit des informations sur leurs fondements théoriques et leurs applications pratiques.

Recent advancements in Multimodal Large Language Models (MLLMs) have highlighted the need to enhance their reasoning capabilities, particularly through the Chain-of-Thought (CoT) paradigm. This approach aims to improve reasoning transparency and interpretability, addressing existing challenges such as opaque reasoning paths and limited generalization abilities. The systematic review of Multimodal Chain-of-Thought (MCoT) methods provides insights into their theoretical foundations and practical applications.

Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

Was this article worth reading? Share it

Accesstive

Https

ModelsLab