arXiv:2503.23239v2 Announce Type: replace-cross 
Abstract: Although synthetic data has changed various aspects of information retrieval (IR) pipelines, the main training paradigm remains: contrastive learning with binary relevance labels, where one positive document is compared against several negatives using the InfoNCE loss. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus missing subtle nuances useful for ranking. To overcome this limitation, in this work, we forgo real documents and annotations and use large language models to directly generate synthetic documents that answer the MS MARCO queries according to several different levels of relevance. We also propose using Wasserstein distance as a more effective loss function for training transformer-based retrievers with graduated relevance labels. Our experiments on MS MARCO and BEIR benchmark show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents, our method significantly improves self-supervised retrievers and is more robust to distribution shift compared to contrastive learning using real data. Our method also successfully integrates existing real data into the synthetic ranking context, further boosting the performance. Overall, we show that generating multi-level ranking contexts is a better approach to synthetic data generation for IR than just generating the standard positive and negative documents.

تسلط دراسة حديثة الضوء على التأثير التحويلي للبيانات الاصطناعية على استرجاع المعلومات، متجاوزة الطرق التقليدية للتعلم التبايني. من خلال تمكين التدريب القائم على القوائم الذي يأخذ في الاعتبار مستويات متعددة من الأهمية، يعد هذا النهج بتحسين دقة وكفاءة أنظمة استرجاع الوثائق.

Un estudio reciente destaca el impacto transformador de los datos sintéticos en la recuperación de información, yendo más allá de los métodos tradicionales de aprendizaje contrastivo. Al permitir un entrenamiento por lista que considera múltiples niveles de relevancia, este enfoque promete mejorar la precisión y eficiencia de los sistemas de recuperación de documentos.

Une étude récente met en lumière l'impact transformateur des données synthétiques sur la récupération d'informations, dépassant les méthodes traditionnelles d'apprentissage contrastif. En permettant un entraînement par liste qui prend en compte plusieurs niveaux de pertinence, cette approche promet d'améliorer la précision et l'efficacité des systèmes de récupération de documents.

A recent study highlights the transformative impact of synthetic data on information retrieval, moving beyond traditional contrastive learning methods. By enabling list-wise training that considers multiple levels of relevance, this approach promises to enhance the accuracy and efficiency of document retrieval systems.

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

arXiv:2601.06575v2 Announce Type: replace 
Abstract: Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.

دراسة حديثة بعنوان 'هل تُرتب المشاعر في دائرة؟' تستكشف التحليل الهندسي لتمثيلات المشاعر من خلال التعلم التبايني الكروي، مقترحةً طريقة لمواءمة المشاعر في تنسيق دائري ضمن تضمينات نماذج اللغة. تهدف هذه الطريقة إلى تحسين قابلية الفهم والصلابة ضد تقليل الأبعاد، على الرغم من أنها تظهر قيودًا في الإعدادات عالية الأبعاد ومهام التصنيف الدقيقة.

Un estudio reciente titulado '¿Están las emociones dispuestas en un círculo?' explora el análisis geométrico de las representaciones emocionales a través del aprendizaje contrastivo hiperesférico, proponiendo un método para alinear las emociones en un formato circular dentro de las incrustaciones de modelos de lenguaje. Este enfoque busca mejorar la interpretabilidad y la robustez frente a la reducción de dimensionalidad, aunque muestra limitaciones en configuraciones de alta dimensión y tareas de clasificación detallada.

Une étude récente intitulée 'Les émotions sont-elles disposées en cercle ?' explore l'analyse géométrique des représentations émotionnelles à travers l'apprentissage contrastif hypersphérique, proposant une méthode pour aligner les émotions dans un format circulaire au sein des embeddings des modèles de langage. Cette approche vise à améliorer l'interprétabilité et la robustesse face à la réduction de dimension, bien qu'elle présente des limites dans les contextes à haute dimension et de classification fine.

A recent study titled 'Are Emotions Arranged in a Circle?' explores the geometric analysis of emotion representations through hyperspherical contrastive learning, proposing a method to align emotions in a circular format within language model embeddings. This approach aims to enhance interpretability and robustness against dimensionality reduction, although it shows limitations in high-dimensional settings and fine-grained classification tasks.

Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning

One More Thing in AI – Your Shortcut to AI Mastery

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Cont3xt.dev

The Visualizer

Resyfy AI

CodeSpaced

Ready to build your own newsroom?