arXiv:2512.08404v1 Announce Type: new 
Abstract: This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.

دراسة حديثة تستقصي التحيز في تعليقات النصوص التي تنتجها الذكاء الاصطناعي، من خلال تكرار التعليقات اليدوية لبويكيس (2024) باستخدام نماذج لغوية كبيرة متنوعة مثل Llama3.1 وLlama3.3 وGPT4o وQwen2.5. تشير النتائج إلى أنه على الرغم من تحقيق نماذج الذكاء الاصطناعي لدرجات F1 مناسبة، إلا أنها تظهر تحيزًا منهجيًا، حيث تتوافق أكثر مع بعضها البعض مقارنة بالتعليقات اليدوية، مما يؤدي إلى نتائج مختلفة في التطبيقات اللاحقة.

Un estudio reciente investiga el sesgo en las anotaciones de texto generativas de IA, replicando anotaciones manuales de Boukes (2024) utilizando varios Modelos de Lenguaje Generativos (GLLMs) como Llama3.1, Llama3.3, GPT4o y Qwen2.5. Los hallazgos indican que, aunque los GLLMs logran puntuaciones F1 adecuadas, exhiben un sesgo sistemático, alineándose más entre sí que con las anotaciones manuales, lo que conduce a resultados diferentes en downstream.

Une étude récente examine le biais dans les annotations de texte génératives par IA, en répliquant les annotations manuelles de Boukes (2024) à l'aide de divers modèles de langage génératifs (GLLMs) tels que Llama3.1, Llama3.3, GPT4o et Qwen2.5. Les résultats montrent que, bien que les GLLMs obtiennent des scores F1 adéquats, ils présentent un biais systématique, s'alignant davantage les uns avec les autres qu'avec les annotations manuelles, ce qui entraîne des résultats différents en aval.

A recent study investigates bias in generative AI text annotations, replicating manual annotations from Boukes (2024) using various Generative Large Language Models (GLLMs) including Llama3.1, Llama3.3, GPT4o, and Qwen2.5. The findings indicate that while GLLMs achieve adequate F1 scores, they exhibit systematic bias, aligning more closely with each other than with manual annotations, which leads to different downstream results.

Are generative AI text annotations systematically biased?

arXiv:2512.03383v2 Announce Type: replace 
Abstract: Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.

تم تقديم UniQL، وهو إطار موحد للتكميم بعد التدريب وضغط منخفض الرتبة، لمعالجة التحديات المتعلقة بنشر نماذج اللغة الكبيرة (LLMs) على المنصات المحمولة، التي غالبًا ما تواجه قيودًا في الذاكرة والموارد الحاسوبية. يسمح هذا الإطار بمعدلات تقليم قابلة للتكوين على الجهاز، مما يعزز قابلية التكيف لنماذج LLMs في الأطراف.

La introducción de UniQL, un marco unificado de cuantificación post-entrenamiento y compresión de bajo rango, aborda los desafíos de desplegar grandes modelos de lenguaje (LLMs) en plataformas móviles, que a menudo enfrentan limitaciones en memoria y recursos computacionales. Este marco permite tasas de poda configurables en el dispositivo, mejorando la adaptabilidad de los LLMs en el borde.

L'introduction de UniQL, un cadre unifié de quantification post-formation et de compression à faible rang, répond aux défis du déploiement de grands modèles de langage (LLMs) sur des plateformes mobiles, qui font souvent face à des limitations en mémoire et en ressources computationnelles. Ce cadre permet des taux de taille configurables sur l'appareil, améliorant l'adaptabilité des LLMs en périphérie.

The introduction of UniQL, a unified post-training quantization and low-rank compression framework, addresses the challenges of deploying large language models (LLMs) on mobile platforms, which often face limitations in memory and computational resources. This framework allows for on-device configurable pruning rates, enhancing the adaptability of edge LLMs.

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

arXiv:2510.19060v2 Announce Type: replace 
Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

تهدف مقدمة PoSh، وهي مقياس جديد يستخدم الرسوم البيانية للمشاهد، إلى تحسين تقييم نماذج الرؤية-اللغة (VLM) في توليد أوصاف تفصيلية للصور. لقد واجهت المقاييس التقليدية مثل CIDEr وSPICE صعوبات مع النصوص الأطول، وغالبًا ما تفشل في تقييم الفهم التكويني والأخطاء المحددة بدقة. تسعى PoSh إلى تقديم نظام تقييم أكثر قابلية للتفسير والتكرار، تم التحقق منه من خلال مجموعة بيانات DOCENT، التي تتضمن مراجع مكتوبة من قبل خبراء لأعمال فنية.

La introducción de PoSh, una nueva métrica que utiliza gráficos de escena, tiene como objetivo mejorar la evaluación de los Modelos de Visión-Lenguaje (VLM) en la generación de descripciones detalladas de imágenes. Las métricas tradicionales como CIDEr y SPICE han tenido dificultades con textos más largos, a menudo fallando en evaluar con precisión la comprensión composicional y errores específicos. PoSh busca proporcionar un sistema de puntuación más interpretable y replicable, validado a través del conjunto de datos DOCENT, que incluye referencias escritas por expertos para obras de arte.

L'introduction de PoSh, une nouvelle métrique utilisant des graphes de scène, vise à améliorer l'évaluation des modèles de vision-langage (VLM) dans la génération de descriptions d'images détaillées. Les métriques traditionnelles comme CIDEr et SPICE ont du mal avec les textes plus longs, échouant souvent à évaluer avec précision la compréhension compositionnelle et les erreurs spécifiques. PoSh cherche à fournir un système de notation plus interprétable et réplicable, validé par le jeu de données DOCENT, qui comprend des références écrites par des experts pour des œuvres d'art.

The introduction of PoSh, a new metric utilizing scene graphs, aims to enhance the evaluation of Vision-Language Models (VLMs) in generating detailed image descriptions. Traditional metrics like CIDEr and SPICE have struggled with longer texts, often failing to accurately assess compositional understanding and specific errors. PoSh seeks to provide a more interpretable and replicable scoring system, validated through the DOCENT dataset, which includes expert-written references for artwork.

Are generative AI text annotations systematically biased?

Was this article worth reading? Share it