Are generative AI text annotations systematically biased?

arXiv — cs.CLWednesday, December 10, 2025 at 5:00:00 AM
  • A recent study investigates bias in generative AI text annotations, replicating manual annotations from Boukes (2024) using various Generative Large Language Models (GLLMs) including Llama3.1, Llama3.3, GPT4o, and Qwen2.5. The findings indicate that while GLLMs achieve adequate F1 scores, they exhibit systematic bias, aligning more closely with each other than with manual annotations, which leads to different downstream results.
  • This development is significant as it highlights the limitations of current GLLMs in accurately reflecting human annotations, raising concerns about the reliability of AI-generated content in various applications, including political discourse and social media interactions.
  • The issue of bias in AI systems is increasingly relevant as advancements in AI technology continue to evolve. The introduction of benchmarks like FragFake aims to tackle challenges in detecting AI-generated content, emphasizing the need for improved methodologies to ensure the integrity of AI outputs across different domains.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Continue Readings
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
PositiveArtificial Intelligence
The introduction of UniQL, a unified post-training quantization and low-rank compression framework, addresses the challenges of deploying large language models (LLMs) on mobile platforms, which often face limitations in memory and computational resources. This framework allows for on-device configurable pruning rates, enhancing the adaptability of edge LLMs.
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
NeutralArtificial Intelligence
The introduction of PoSh, a new metric utilizing scene graphs, aims to enhance the evaluation of Vision-Language Models (VLMs) in generating detailed image descriptions. Traditional metrics like CIDEr and SPICE have struggled with longer texts, often failing to accurately assess compositional understanding and specific errors. PoSh seeks to provide a more interpretable and replicable scoring system, validated through the DOCENT dataset, which includes expert-written references for artwork.