LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

arXiv — cs.CL•Tuesday, December 9, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Recent research indicates that large language models (LLMs) demonstrate biases in evaluation tasks, particularly favoring self-generated content. However, a study exploring retrieval-augmented generation (RAG) frameworks found no significant self-preference effect, suggesting that LLMs can evaluate factual content more impartially than previously thought.
This finding is crucial as it challenges the prevailing notion that LLMs are inherently biased in all evaluative contexts, potentially improving their application in fact-oriented tasks and enhancing trust in AI-generated outputs.
The implications of these results resonate with ongoing discussions about the reliability and fairness of LLMs, particularly in light of studies addressing prompt fairness and bias correction, highlighting the need for continued scrutiny and refinement of AI evaluation methods.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataView app details

ZeroGPT.org

Detect AI-generated text and check for plagiarism with accurate, reliable results.

AI & DataView app details

Continue Readings

arXiv — cs.CL18 hours ago

Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

NeutralArtificial Intelligence

A comprehensive study has been conducted on the use of large language models (LLMs) for synthesizing public deliberations into neutral summaries. The research highlights the potential of LLMs to generate summaries while also addressing concerns regarding their ability to represent minority perspectives and biases related to input order. The study introduces DeliberationBank, a dataset created from contributions by 3,000 participants, aimed at evaluating LLM performance in summarization tasks.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation

NeutralArtificial Intelligence

A recent empirical study on Large Language Models (LLMs) has revealed that the effectiveness of many-shot prompting for code translation may be overstated. Analyzing over 90,000 translations, researchers found that while more examples can improve static similarity metrics, functional correctness peaks with fewer examples, indicating a 'many-shot paradox'.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

PositiveArtificial Intelligence

QSTN has been introduced as an open-source Python framework designed to generate responses from questionnaire-style prompts, facilitating in-silico surveys and annotation tasks with large language models (LLMs). The framework allows for robust evaluation of questionnaire presentation and response generation methods, based on an extensive analysis of over 40 million survey responses.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

PositiveArtificial Intelligence

A recent study has introduced a systematic evaluation framework for aligning large language models (LLMs) with diverse human preferences in federated learning environments. This framework assesses the trade-off between alignment quality and fairness using various aggregation strategies for human preferences, including a novel adaptive scheme that adjusts preference weights based on historical performance.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic

NeutralArtificial Intelligence

The evaluation of large language models (LLMs) has been enhanced by introducing Balanced Accuracy as a metric, which is theoretically aligned with Youden's J statistic. This approach addresses the limitations of traditional metrics like Accuracy and Precision, which can be skewed by class imbalances and arbitrary positive class selections. By utilizing Balanced Accuracy, the selection of judges for model comparisons becomes more reliable and robust.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

PositiveArtificial Intelligence

Researchers have introduced the R1 Translator model, which aims to enhance the decoding of EEG signals into text by combining a bidirectional LSTM encoder with a pretrained transformer-based decoder. This model addresses the limitations of existing EEG-to-text translation models, such as T5 and Brain Translator, and demonstrates superior performance in ROUGE metrics.

Read full article

via arXiv — cs.CL

arXiv — cs.CL18 hours ago

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

PositiveArtificial Intelligence

The introduction of Omniguard presents a novel approach to AI safety moderation, specifically targeting the detection of harmful prompts across various languages and modalities. This method enhances the accuracy of harmful prompt classification by 11.57% compared to existing baselines, addressing concerns about the misuse of large language models (LLMs) and their susceptibility to attacks that exploit language and modality mismatches.

Read full article

via arXiv — cs.CL

VentureBeat — AIa day ago

Mistral launches powerful Devstral 2 coding model including open source, laptop-friendly version

PositiveArtificial Intelligence

French AI startup Mistral has launched the Devstral 2 coding model, which includes a laptop-friendly version optimized for software engineering tasks. This release follows the introduction of the Mistral 3 LLM family, aimed at enhancing local hardware capabilities for developers.

Read full article

via VentureBeat — AI