arXiv:2511.10671v1 Announce Type: new 
Abstract: Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

يتناول المقال بعنوان 'Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency' مشكلة الهلوسة البصرية في نماذج اللغة متعددة الوسائط (MLLM)، حيث تقوم هذه النماذج بإنشاء تفاصيل غير متوافقة مع محتوى الصور. أظهرت طرق الضبط الحالية تحسينات محدودة في التفكير الواقعي. يقترح المؤلفون نهجًا جديدًا يسمى Grounded Visual Factualization (GVF) Finetuning، والذي يعزز الاتساق الواقعي البصري من خلال ثلاثة آليات: زيادة بيانات مرجعية واقعية، وضبط التعليمات الواعية بالحقائق، ودالة خسارة الاتساق الواقعي. تظهر التقييمات على LLaVA-1.5-13B أن GVF يتفوق بشكل كبير على طرق ا…

El artículo titulado 'Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency' aborda el problema de la alucinación visual en los Modelos de Lenguaje Multimodal (MLLM), donde estos modelos generan detalles que son inconsistentes con el contenido de las imágenes. Los métodos de ajuste actuales han mostrado mejoras limitadas en el razonamiento factual. Los autores proponen un nuevo enfoque llamado Grounded Visual Factualization (GVF) Finetuning, que mejora la consistencia factual visual a través de tres mecanismos: Aumento de Datos de Anclaje Factua…

L'article intitulé 'Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency' traite du problème de l'hallucination visuelle dans les modèles de langage multimodaux (MLLM), où ces modèles génèrent des détails qui ne correspondent pas aux images accompagnantes. Les méthodes de réglage actuelles ont montré un succès limité dans l'amélioration du raisonnement factuel. Les auteurs proposent une nouvelle approche appelée Grounded Visual Factualization (GVF) Finetuning, qui améliore la cohérence factuelle visuelle grâce à trois mécanismes : l'augmentatio…

The paper titled 'Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency' addresses the issue of visual hallucination in Multimodal Large Language Models (MLLMs), where these models generate details that are inconsistent with the accompanying images. Current fine-tuning methods have shown limited success in improving factual reasoning. The authors propose a new approach called Grounded Visual Factualization (GVF) Finetuning, which enhances visual factual consistency through three mechanisms: Factual Anchor Data Augmentation, Fact-Aware Instructio…

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

The service is customized for teachers' needs and includes added security and privacy, a collaborative workspace, and more.

OpenAI expands free educational offerings - here's what ChatGPT for Teachers can do

GPT-5.1-Codex-Max is ready to take on your next massive coding job. Here's what's new.

OpenAI's Codex Max solves one of my biggest AI coding annoyances - and adds dramatically faster performance

The agent offers one-click buying for all your holiday needs and will be free for all US-based users.

Perplexity's AI shopping tool is free for all now, just in time for Black Friday - how to use it

If aesthetics and efficiency top your list of needs, there are several Linux distributions that are right up your alley. Both Ubuntu Budgie and Pop!_OS should top that list.

Ubuntu Budgie vs. Pop!_OS: I've used both Linux distros - here's how to choose

The Nomad Stratos Band might just be my favorite Apple Watch band ever. Here's what makes it special.

My search for the ultimate Apple Watch band is over: This one checks all the boxes for me

With blazing performance and liquid cooling, Redmagic's 11 Pro boasts the best mobile gaming I've experienced.

I'm a diehard Pixel user, but this liquid-cooled Android phone has my attention

arXiv:2508.11999v2 Announce Type: replace-cross 
Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.

يتناول المقال تطوير MOON، وهو نموذج لغوي متعدد الوسائط (MLLM) مصمم لتحسين تعلم تمثيل المنتجات في التجارة الإلكترونية. تواجه الهياكل التقليدية ذات التدفق المزدوج تحديات في محاذاة صور ونصوص متعددة للمنتجات. يهدف MOON إلى معالجة هذه القضايا من خلال دمج تقنيات النمذجة التوليدية، على الرغم من أنه يواجه عقبات مثل الضوضاء الخلفية في الصور وغياب معايير التقييم.

El artículo aborda el desarrollo de MOON, un modelo de lenguaje multimodal generativo (MLLM) diseñado para mejorar el aprendizaje de la representación de productos en el comercio electrónico. Las arquitecturas de flujo dual tradicionales enfrentan desafíos para alinear múltiples imágenes y textos de productos. MOON busca abordar estos problemas mediante la incorporación de técnicas de modelado generativo, aunque enfrenta obstáculos como el ruido de fondo en las imágenes y la falta de estándares de evaluación.

L'article traite du développement de MOON, un modèle de langage multimodal génératif (MLLM) conçu pour améliorer l'apprentissage de la représentation des produits dans le commerce électronique. Les architectures à double flux traditionnelles rencontrent des difficultés pour aligner plusieurs images et textes de produits. MOON vise à résoudre ces problèmes en intégrant des techniques de modélisation générative, bien qu'il fasse face à des obstacles tels que le bruit de fond dans les images et l'absence de normes d'évaluation.

The article discusses the development of MOON, a generative Multimodal Large Language Model (MLLM) designed to enhance product representation learning in e-commerce. Traditional dual-flow architectures face challenges in aligning multiple images and texts for products. MOON aims to address these issues by incorporating generative modeling techniques, although it faces hurdles such as background noise in images and the lack of standard evaluation benchmarks.

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

arXiv:2506.22481v2 Announce Type: replace-cross 
Abstract: In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.

أدت التطورات الأخيرة في معالجة اللغة الطبيعية (NLP) إلى استخدام واسع النطاق لنماذج اللغة، مما أثار أبحاثًا حول كيفية انعكاس وتعزيز التحيزات الاجتماعية، بما في ذلك التحيزات الجندرية والعرقية. ومع ذلك، هناك فجوة ملحوظة في تحليل كيفية تمثيل الهويات الجنسية غير التقليدية في أنظمة NLP. تكشف دراسة شملت 55 مقالًا أن مفهوم الجنسية غالبًا ما يكون غير محدد بوضوح، مما يعتمد على افتراضات معيارية حول الهويات والممارسات الجنسية والرومانسية، مما يثير مخاوف بشأن كيفية تشغيل مفهوم الجنسية في أبحاث التحيز في NLP.

Los avances recientes en el procesamiento del lenguaje natural (NLP) han llevado a un uso generalizado de modelos de lenguaje, lo que ha provocado investigaciones sobre cómo se reflejan y amplifican los sesgos sociales, incluidos los sesgos de género y raciales. Sin embargo, existe una notable brecha en el análisis de cómo se representan las sexualidades queer en los sistemas de NLP. Una encuesta de 55 artículos revela que la sexualidad a menudo está mal definida, dependiendo de suposiciones normativas sobre las identidades sexuales y románticas, lo que plantea preocupaciones sobre la operacio…

Les avancées récentes en traitement du langage naturel (NLP) ont conduit à une utilisation généralisée des modèles linguistiques, suscitant des recherches sur la réflexion et l'amplification des biais sociaux, y compris les biais de genre et raciaux. Cependant, il existe un écart notable dans l'analyse de la représentation des sexualités queer dans les systèmes NLP. Une enquête sur 55 articles révèle que la sexualité est souvent mal définie, reposant sur des hypothèses normatives concernant les identités sexuelles et romantiques, ce qui soulève des préoccupations quant à l'opérationnalisation …

Recent advancements in Natural Language Processing (NLP) have led to the widespread use of language models, prompting research into the reflection and amplification of social biases, including gender and racial bias. However, there is a notable gap in the analysis of how queer sexualities are represented in NLP systems. A survey of 55 articles reveals that sexuality is often poorly defined, relying on normative assumptions about sexual and romantic identities, which raises concerns about the operationalization of sexuality in NLP bias research.

Theories of "Sexuality" in Natural Language Processing Bias Research

arXiv:2503.11858v3 Announce Type: replace 
Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

OpeNLGauge هي مقياس مفتوح المصدر جديد لتقييم أنظمة توليد اللغة الطبيعية (NLG) باستخدام نماذج اللغة الكبيرة (LLMs). على عكس المقاييس الحالية التي تعتمد على نماذج ملكية، يوفر OpeNLGauge تقييمات بدون مرجع ويقدم تفسيرات دقيقة تعتمد على نطاقات الأخطاء. تم تصميمه ليكون قابلاً للتكيف مع مهام ومجالات متنوعة، حيث يظهر ارتباطًا تنافسيًا مع أحكام البشر ويتفوق على بعض النماذج المتطورة مع ضمان القابلية للتكرار.

OpeNLGauge es una nueva métrica de código abierto para la evaluación de sistemas de Generación de Lenguaje Natural (NLG) que utiliza Modelos de Lenguaje de Gran Tamaño (LLMs). A diferencia de las métricas existentes que dependen de modelos propietarios, OpeNLGauge ofrece evaluaciones sin referencia y proporciona explicaciones detalladas basadas en rangos de error. Está diseñada para ser adaptable a diversas tareas y dominios, mostrando una correlación competitiva con los juicios humanos y superando a algunos modelos de última generación, garantizando la reproducibilidad.

OpeNLGauge est une nouvelle métrique open-source pour l'évaluation des systèmes de génération de langage naturel (NLG) utilisant des modèles de langage de grande taille (LLM). Contrairement aux métriques existantes qui dépendent de modèles propriétaires, OpeNLGauge propose des évaluations sans référence et fournit des explications détaillées basées sur des plages d'erreurs. Elle est conçue pour être adaptable à diverses tâches et domaines, montrant une corrélation compétitive avec les jugements humains et surpassant certains modèles à la pointe de la technologie tout en garantissant la reprodu…

OpeNLGauge is a newly introduced open-source metric for evaluating Natural Language Generation (NLG) systems using Large Language Models (LLMs). Unlike existing metrics that depend on proprietary models, OpeNLGauge offers reference-free evaluations and provides detailed explanations based on error spans. It is designed to be adaptable to various tasks and domains, demonstrating competitive correlation with human judgments and outperforming some state-of-the-art models while ensuring reproducibility.

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

arXiv:2511.14112v1 Announce Type: new 
Abstract: Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

يعد الترميز التلقائي لرموز ICD من النصوص السريرية أمرًا ضروريًا في معالجة اللغة الطبيعية الطبية، ولكنه يواجه تحديات بسبب توزيع الرموز التشخيصية الطويلة. العديد من رموز ICD النادرة ممثلة تمثيلًا ناقصًا في مجموعات البيانات مثل MIMIC-III، مما يؤدي إلى انخفاض درجات macro-F1. يقدم هذا العمل إطارًا مركزيًا للبيانات يولد ملخصات خروج اصطناعية عالية الجودة للتخفيف من هذا الخلل. باستخدام أنماط التواجد الواقعية وموارد أخرى، يتم إنتاج 90,000 ملاحظة اصطناعية تغطي 7,902 رمز ICD، مما يزيد بشكل كبير من توزيع التدريب. يُظهر ضبط النموذجين PLM-ICD وGKI-ICD على هذه المجموعات من البيانات تحسينات متواضعة في درجات m…

El codificación automática de ICD a partir de textos clínicos es esencial en el procesamiento del lenguaje natural médico, pero enfrenta desafíos debido a la distribución de larga cola de los códigos diagnósticos. Muchos códigos ICD raros están subrepresentados en conjuntos de datos como MIMIC-III, lo que resulta en bajos puntajes macro-F1. Este trabajo presenta un marco centrado en los datos que genera resúmenes de alta calidad para mitigar este desequilibrio. Utilizando patrones de co-ocurrencia del mundo real y otros recursos, se generan 90,000 notas sintéticas que cubren 7,902 códigos ICD,…

Le codage automatique des ICD à partir de textes cliniques est essentiel en NLP médical, mais il est confronté à des défis en raison de la distribution longue traîne des codes diagnostiques. De nombreux codes ICD rares sont sous-représentés dans des ensembles de données comme MIMIC-III, entraînant de faibles scores macro-F1. Ce travail propose un cadre centré sur les données qui génère des résumés de sortie synthétiques pour remédier à ce problème. En utilisant des modèles de co-occurrence du monde réel et d'autres ressources, la méthode produit 90 000 notes synthétiques pour 7 902 codes ICD, …

Automatic ICD coding from clinical text is essential in medical NLP but faces challenges due to the long-tail distribution of diagnostic codes. Many rare ICD codes are underrepresented in datasets like MIMIC-III, resulting in low macro-F1 scores. This work introduces a data-centric framework that generates synthetic discharge summaries to address this issue. By utilizing real-world co-occurrence patterns and other resources, the method produces 90,000 synthetic notes for 7,902 ICD codes, enhancing the training distribution. Fine-tuning of PLM-ICD and GKI-ICD models on these datasets shows mode…

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Was this article worth reading? Share it