arXiv:2207.01262v4 Announce Type: replace-cross 
Abstract: We tested over 20 Transformer models for ranking long documents (including recent LongP models trained with FlashAttention and RankGPT models "powered" by OpenAI and Anthropic cloud APIs). We compared them with the simple FirstP baseline, which applied the same model to truncated input (up to 512 tokens). On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average). We hypothesized that this lack of improvement is not due to inherent model limitations, but due to benchmark positional bias (most relevant passages tend to occur early in documents), which is known to exist in MS MARCO. To confirm this, we analyzed positional relevance distributions across four long-document corpora (with six query sets) and observed the same early-position bias. Surprisingly, we also found bias in six BEIR collections, which are typically categorized as short-document datasets. We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens. On this dataset, many long-context models (including RankGPT) performed at random-baseline level, suggesting overfitting to positional bias. We also experimented with debiasing training data, but with limited success. Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to debias training data. We release our code and data to support further research.

أجرت دراسة اختبارًا لأكثر من 20 نموذجًا من نماذج Transformer لتصنيف الوثائق الطويلة، وكشفت أن أي نموذج لم يتفوق بشكل كبير على نموذج أساسي بسيط (FirstP) بسبب تحيز موضعي في مجموعات البيانات المرجعية. تم تأكيد هذا التحيز، حيث تميل المعلومات ذات الصلة إلى الظهور في بداية الوثائق، عبر مجموعات بيانات متعددة، بما في ذلك MS MARCO. تسلط النتائج الضوء على أهمية التصميم الدقيق للمعايير في تقييم نماذج السياق الطويل.

Un estudio probó más de 20 modelos Transformer para clasificar documentos largos, revelando que ningún modelo superó significativamente una línea base simple (FirstP) debido a un sesgo posicional en los conjuntos de datos de referencia. Este sesgo, donde la información relevante tiende a aparecer al principio de los documentos, se confirmó en múltiples conjuntos de datos, incluido MS MARCO. Los hallazgos subrayan la importancia de un diseño cuidadoso de los benchmarks en la evaluación de modelos de contexto largo.

Une étude a testé plus de 20 modèles Transformer pour le classement de longs documents, révélant qu'aucun modèle n'a surpassé de manière significative une base simple (FirstP) en raison d'un biais de position dans les ensembles de données de référence. Ce biais, où les informations pertinentes ont tendance à apparaître tôt dans les documents, a été confirmé dans plusieurs ensembles de données, y compris MS MARCO. Les résultats soulignent l'importance d'une conception minutieuse des références pour évaluer les modèles à long contexte.

A study tested over 20 Transformer models for ranking long documents, revealing that no model significantly outperformed a simple baseline (FirstP) due to a positional bias in benchmark datasets. This bias, where relevant information tends to appear early in documents, was confirmed across multiple datasets, including MS MARCO. The findings underscore the importance of careful benchmark design in evaluating long-context models.

Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation

Carry your cash and cards along with your phone thanks to MagSafe wallets from brands like Moft, ESR, and Nomad.

Best MagSafe wallets 2025: I tested the best options to store your cash and cards

With the latest Grit X2, Polar continues to optimize experiences for training, coaching, and performance improvement.

Not enough people are talking about this Garmin competitor that wins in unique ways

What do you get for the person who has everything? There's a good chance that something on this list will appeal to them.

The 12 best gizmos to gift the person who has everything, according to a gadget expert

Great deals are popping up everywhere this holiday shopping season, and kids' tech is no exception.

I'm always looking for deals on kids' tech - here's what I'm buying this holiday season

We tested and reviewed the best 40-inch TVs you can buy from brands like Samsung, Sony, and TCL to help you find the right fit for your budget and home theater.

The best 40-inch TVs of 2025: Expert tested for smaller spaces

We review dozens of laptops each year at ZDNET. These are the most popular ones among our readers for 2025.

The top 10 laptops our readers bought this year (no. 1 surprised us)

arXiv:2506.22481v2 Announce Type: replace-cross 
Abstract: In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.

أدت التطورات الأخيرة في معالجة اللغة الطبيعية (NLP) إلى استخدام واسع النطاق لنماذج اللغة، مما أثار أبحاثًا حول كيفية انعكاس وتعزيز التحيزات الاجتماعية، بما في ذلك التحيزات الجندرية والعرقية. ومع ذلك، هناك فجوة ملحوظة في تحليل كيفية تمثيل الهويات الجنسية غير التقليدية في أنظمة NLP. تكشف دراسة شملت 55 مقالًا أن مفهوم الجنسية غالبًا ما يكون غير محدد بوضوح، مما يعتمد على افتراضات معيارية حول الهويات والممارسات الجنسية والرومانسية، مما يثير مخاوف بشأن كيفية تشغيل مفهوم الجنسية في أبحاث التحيز في NLP.

Los avances recientes en el procesamiento del lenguaje natural (NLP) han llevado a un uso generalizado de modelos de lenguaje, lo que ha provocado investigaciones sobre cómo se reflejan y amplifican los sesgos sociales, incluidos los sesgos de género y raciales. Sin embargo, existe una notable brecha en el análisis de cómo se representan las sexualidades queer en los sistemas de NLP. Una encuesta de 55 artículos revela que la sexualidad a menudo está mal definida, dependiendo de suposiciones normativas sobre las identidades sexuales y románticas, lo que plantea preocupaciones sobre la operacio…

Les avancées récentes en traitement du langage naturel (NLP) ont conduit à une utilisation généralisée des modèles linguistiques, suscitant des recherches sur la réflexion et l'amplification des biais sociaux, y compris les biais de genre et raciaux. Cependant, il existe un écart notable dans l'analyse de la représentation des sexualités queer dans les systèmes NLP. Une enquête sur 55 articles révèle que la sexualité est souvent mal définie, reposant sur des hypothèses normatives concernant les identités sexuelles et romantiques, ce qui soulève des préoccupations quant à l'opérationnalisation …

Recent advancements in Natural Language Processing (NLP) have led to the widespread use of language models, prompting research into the reflection and amplification of social biases, including gender and racial bias. However, there is a notable gap in the analysis of how queer sexualities are represented in NLP systems. A survey of 55 articles reveals that sexuality is often poorly defined, relying on normative assumptions about sexual and romantic identities, which raises concerns about the operationalization of sexuality in NLP bias research.

Theories of "Sexuality" in Natural Language Processing Bias Research

arXiv:2503.11858v3 Announce Type: replace 
Abstract: Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

OpeNLGauge هي مقياس مفتوح المصدر جديد لتقييم أنظمة توليد اللغة الطبيعية (NLG) باستخدام نماذج اللغة الكبيرة (LLMs). على عكس المقاييس الحالية التي تعتمد على نماذج ملكية، يوفر OpeNLGauge تقييمات بدون مرجع ويقدم تفسيرات دقيقة تعتمد على نطاقات الأخطاء. تم تصميمه ليكون قابلاً للتكيف مع مهام ومجالات متنوعة، حيث يظهر ارتباطًا تنافسيًا مع أحكام البشر ويتفوق على بعض النماذج المتطورة مع ضمان القابلية للتكرار.

OpeNLGauge es una nueva métrica de código abierto para la evaluación de sistemas de Generación de Lenguaje Natural (NLG) que utiliza Modelos de Lenguaje de Gran Tamaño (LLMs). A diferencia de las métricas existentes que dependen de modelos propietarios, OpeNLGauge ofrece evaluaciones sin referencia y proporciona explicaciones detalladas basadas en rangos de error. Está diseñada para ser adaptable a diversas tareas y dominios, mostrando una correlación competitiva con los juicios humanos y superando a algunos modelos de última generación, garantizando la reproducibilidad.

OpeNLGauge est une nouvelle métrique open-source pour l'évaluation des systèmes de génération de langage naturel (NLG) utilisant des modèles de langage de grande taille (LLM). Contrairement aux métriques existantes qui dépendent de modèles propriétaires, OpeNLGauge propose des évaluations sans référence et fournit des explications détaillées basées sur des plages d'erreurs. Elle est conçue pour être adaptable à diverses tâches et domaines, montrant une corrélation compétitive avec les jugements humains et surpassant certains modèles à la pointe de la technologie tout en garantissant la reprodu…

OpeNLGauge is a newly introduced open-source metric for evaluating Natural Language Generation (NLG) systems using Large Language Models (LLMs). Unlike existing metrics that depend on proprietary models, OpeNLGauge offers reference-free evaluations and provides detailed explanations based on error spans. It is designed to be adaptable to various tasks and domains, demonstrating competitive correlation with human judgments and outperforming some state-of-the-art models while ensuring reproducibility.

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

arXiv:2511.14112v1 Announce Type: new 
Abstract: Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

يعد الترميز التلقائي لرموز ICD من النصوص السريرية أمرًا ضروريًا في معالجة اللغة الطبيعية الطبية، ولكنه يواجه تحديات بسبب توزيع الرموز التشخيصية الطويلة. العديد من رموز ICD النادرة ممثلة تمثيلًا ناقصًا في مجموعات البيانات مثل MIMIC-III، مما يؤدي إلى انخفاض درجات macro-F1. يقدم هذا العمل إطارًا مركزيًا للبيانات يولد ملخصات خروج اصطناعية عالية الجودة للتخفيف من هذا الخلل. باستخدام أنماط التواجد الواقعية وموارد أخرى، يتم إنتاج 90,000 ملاحظة اصطناعية تغطي 7,902 رمز ICD، مما يزيد بشكل كبير من توزيع التدريب. يُظهر ضبط النموذجين PLM-ICD وGKI-ICD على هذه المجموعات من البيانات تحسينات متواضعة في درجات m…

El codificación automática de ICD a partir de textos clínicos es esencial en el procesamiento del lenguaje natural médico, pero enfrenta desafíos debido a la distribución de larga cola de los códigos diagnósticos. Muchos códigos ICD raros están subrepresentados en conjuntos de datos como MIMIC-III, lo que resulta en bajos puntajes macro-F1. Este trabajo presenta un marco centrado en los datos que genera resúmenes de alta calidad para mitigar este desequilibrio. Utilizando patrones de co-ocurrencia del mundo real y otros recursos, se generan 90,000 notas sintéticas que cubren 7,902 códigos ICD,…

Le codage automatique des ICD à partir de textes cliniques est essentiel en NLP médical, mais il est confronté à des défis en raison de la distribution longue traîne des codes diagnostiques. De nombreux codes ICD rares sont sous-représentés dans des ensembles de données comme MIMIC-III, entraînant de faibles scores macro-F1. Ce travail propose un cadre centré sur les données qui génère des résumés de sortie synthétiques pour remédier à ce problème. En utilisant des modèles de co-occurrence du monde réel et d'autres ressources, la méthode produit 90 000 notes synthétiques pour 7 902 codes ICD, …

Automatic ICD coding from clinical text is essential in medical NLP but faces challenges due to the long-tail distribution of diagnostic codes. Many rare ICD codes are underrepresented in datasets like MIMIC-III, resulting in low macro-F1 scores. This work introduces a data-centric framework that generates synthetic discharge summaries to address this issue. By utilizing real-world co-occurrence patterns and other resources, the method produces 90,000 synthetic notes for 7,902 ICD codes, enhancing the training distribution. Fine-tuning of PLM-ICD and GKI-ICD models on these datasets shows mode…

Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation

Was this article worth reading? Share it