OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
Open Knowledge Bench (OKBench) was introduced to address the limitations of static benchmarks used in evaluating large language models (LLMs). Traditional benchmarks, often based on sources like Wikipedia, do not keep pace with the rapid evolution of knowledge, particularly in dynamic fields such as news. OKBench automates the sourcing, creation, validation, and distribution of benchmarks, allowing for real-time updates and evaluations. This democratization of benchmark creation is significant as it enables a more thorough assessment of retrieval-augmented methods, revealing distinct model behaviors when faced with new information. The findings indicate that retrieval can effectively narrow the performance gap between smaller and larger models, underscoring the framework's potential to enhance the evaluation landscape for LLMs.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
PositiveArtificial Intelligence
The article presents a new framework called GMAT, which enhances Multiple Instance Learning (MIL) for whole slide image (WSI) classification. By integrating vision-language models (VLMs), GMAT aims to improve the generation of clinical descriptions that are more expressive and medically specific. This addresses limitations in existing methods that rely on large language models (LLMs) for generating descriptions, which often lack domain grounding and detailed medical specificity, thus improving alignment with visual features.
Silenced Biases: The Dark Side LLMs Learned to Refuse
NegativeArtificial Intelligence
Safety-aligned large language models (LLMs) are increasingly used in sensitive applications where fairness is crucial. Evaluating their fairness is complex, often relying on standard question-answer methods that misinterpret refusal responses as indicators of fairness. This paper introduces the concept of silenced biases, which are unfair preferences hidden within the models' latent space, masked by safety-alignment. Previous methods have limitations, prompting the need for new approaches to uncover these biases effectively.
Fair In-Context Learning via Latent Concept Variables
PositiveArtificial Intelligence
The paper titled 'Fair In-Context Learning via Latent Concept Variables' explores the in-context learning (ICL) capabilities of large language models (LLMs) in handling tabular data. It highlights the potential for LLMs to inherit biases from pre-training data, which can lead to discrimination in high-stakes applications. The authors propose an optimal demonstration selection method using latent concept variables to enhance task adaptation and fairness, alongside data augmentation strategies to minimize correlations between sensitive variables and predictive outcomes.
Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models
PositiveArtificial Intelligence
The article introduces the PReference Orchestrator (PRO), a framework designed to enhance the alignment of Large Language Models (LLMs) with diverse human preferences across multiple objectives. Traditional methods rely on manually set preference weights, which can hinder training efficiency and complicate user experience. PRO addresses these challenges by utilizing a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment, thereby improving performance and efficiency.
From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
NeutralArtificial Intelligence
The article investigates the impact of task framing on the conviction of large language models (LLMs) in dialogue systems. It explores how LLMs assess tasks requiring social judgment, contrasting their performance on factual queries with conversational judgment tasks. The study reveals that reframing a task can significantly alter an LLM's judgment, particularly under conversational pressure, highlighting the complexities of LLM decision-making in social contexts.
Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering
PositiveArtificial Intelligence
Large language models (LLMs) have shown potential in medical question answering but often lack the domain-specific expertise required in emergency medical services (EMS). The study introduces EMSQA, a dataset with 24.3K questions across 10 clinical areas and 4 certification levels, along with knowledge bases containing 40K documents and 2M tokens. It also presents Expert-CoT and ExpertRAG, strategies that enhance performance by integrating clinical context, resulting in improved accuracy and exam pass rates for EMS certification.
Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models
PositiveArtificial Intelligence
The paper titled 'Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models' introduces a method to enhance the efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The authors propose a pre-attention expert prediction technique that improves accuracy and reduces computational overhead by utilizing activations before the attention block. This approach aims to optimize expert prefetching, achieving about a 15% improvement in accuracy over existing methods.
A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge
NeutralArtificial Intelligence
A recent study published on arXiv examines the phenomenon of negative bias in large language models (LLMs), which refers to their tendency to generate negative responses in binary decision tasks. The research highlights that previous studies have primarily focused on identifying negative attention heads that contribute to this bias. The authors introduce a new evaluation pipeline that categorizes responses based on the model's parametric knowledge, revealing that the format of prompts significantly influences the responses more than the semantics of the content itself.