arXiv:2510.13515v2 Announce Type: replace 
Abstract: Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

تقدم الورقة UniME-V2، نموذجًا جديدًا للتضمين المتعدد الوسائط العالمي الذي يعزز تعلم التمثيل باستخدام نماذج اللغة متعددة الوسائط الكبيرة (MLLM). تتناول الورقة القيود الموجودة في الأساليب الحالية، مثل عدم القدرة على التقاط الفروق الدلالية الدقيقة ونقص التنوع في العينات السلبية. يقدم آلية MLLM-as-a-Judge التي تقيم المحاذاة الدلالية بين أزواج الاستعلام والمرشح، مما يحسن من استخراج السلبيات الصعبة وجودة التضمين بشكل عام.

El artículo presenta UniME-V2, un nuevo modelo de Embedding Multimodal Universal que mejora el aprendizaje de representaciones utilizando Modelos de Lenguaje Multimodal de Gran Escala (MLLM). Aborda las limitaciones de los métodos existentes, como la incapacidad para capturar diferencias semánticas sutiles y la falta de diversidad en las muestras negativas. El mecanismo MLLM-as-a-Judge evalúa la alineación semántica de las parejas consulta-candidato, mejorando la minería de negativos difíciles y la calidad general de los embeddings.

Cet article présente UniME-V2, un nouveau modèle d'embedding multimodal universel qui améliore l'apprentissage de la représentation en utilisant des modèles de langage multimodaux de grande taille (MLLM). Il aborde les limites des méthodes existantes, telles que l'incapacité à capturer des différences sémantiques subtiles et le manque de diversité dans les échantillons négatifs. Le mécanisme MLLM-as-a-Judge évalue l'alignement sémantique des paires requête-candidat, améliorant ainsi l'extraction de négatifs difficiles et la qualité globale des embeddings.

The paper presents UniME-V2, a novel Universal Multimodal Embedding model that enhances representation learning by utilizing Multimodal Large Language Models (MLLMs). It addresses limitations in existing methods, such as the inability to capture subtle semantic differences and the lack of diversity in negative samples. The proposed MLLM-as-a-Judge mechanism assesses the semantic alignment of query-candidate pairs, improving hard negative mining and overall embedding quality.

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

arXiv:2510.22300v2 Announce Type: replace-cross 
Abstract: Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models. Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety. Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields. The dataset and code are provided in https://github.com/datar001/T2I-RiskyPrompt.

يمثل تقديم T2I-RiskyPrompt تقدمًا كبيرًا في تقييم السلامة في نماذج النص إلى صورة (T2I)، حيث يعالج القيود الموجودة في مجموعات بيانات النصوص الخطرة الحالية من خلال توفير معيار شامل مع تصنيف مخاطر هرمي و6,432 نصًا موضحًا.

La introducción de T2I-RiskyPrompt marca un avance significativo en la evaluación de la seguridad en modelos de texto a imagen (T2I), abordando las limitaciones de los conjuntos de datos de prompts riesgosos existentes al proporcionar un benchmark integral con una taxonomía de riesgo jerárquica y 6,432 prompts anotados.

L'introduction de T2I-RiskyPrompt représente une avancée significative dans l'évaluation de la sécurité des modèles de génération d'images à partir de texte (T2I), en répondant aux limitations des ensembles de données de prompts risqués existants grâce à un benchmark complet avec une taxonomie de risque hiérarchique et 6 432 prompts annotés.

The introduction of T2I-RiskyPrompt marks a significant advancement in the evaluation of safety in text-to-image (T2I) models, addressing the limitations of existing risky prompt datasets by providing a comprehensive benchmark with a hierarchical risk taxonomy and 6,432 annotated prompts.

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

arXiv:2511.01588v2 Announce Type: replace 
Abstract: Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.

تقدم دراسة جديدة إطار العمل للفصل المتوازي (PDF) لتعلم التضمينات متعددة الوسائط، مستفيدة من قدرات نماذج اللغة متعددة الوسائط (MLLM) لإنشاء عدة تضمينات متوازية من إدخال واحد. تهدف هذه الطريقة إلى التغلب على قيود نماذج التضمين التقليدية، التي غالبًا ما تقلل المدخلات المعقدة إلى تمثيلات فردية.

Un nuevo estudio presenta el Marco de Desacoplamiento Paralelo (PDF) para el aprendizaje de embeddings multimodales, aprovechando las capacidades de los Modelos de Lenguaje Multimodal (MLLM) para crear múltiples embeddings paralelos a partir de una sola entrada. Este enfoque busca superar las limitaciones de los modelos de embedding tradicionales, que a menudo reducen entradas complejas a representaciones singulares.

Une nouvelle étude présente le cadre de découplage parallèle (PDF) pour l'apprentissage d'embeddings multimodaux, tirant parti des capacités des modèles de langage multimodaux (MLLM) pour créer plusieurs embeddings parallèles à partir d'une seule entrée. Cette approche vise à surmonter les limitations des modèles d'embedding traditionnels, qui réduisent souvent des entrées complexes à des représentations uniques.

A new study introduces the Parallel Decoupling Framework (PDF) for multimodal embedding learning, leveraging the capabilities of Multimodal Large Language Models (MLLMs) to create multiple parallel embeddings from a single input. This approach aims to overcome the limitations of traditional embedding models, which often reduce complex inputs to singular representations.

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Was this article worth reading? Share it

Magicley AI

Attentive AI

TypeThinkAI