Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

arXiv — cs.CL•Wednesday, November 12, 2025 at 5:00:00 AM

The recent publication on arXiv discusses a novel approach to evaluating large language models (LLMs) in question answering, focusing on the use of Natural Language Inference (NLI) scoring. This method has been shown to achieve an impressive accuracy of 89.9% with the GPT-4o model while being far less resource-intensive than traditional methods. The introduction of DIVER-QA, a benchmark consisting of 3,000 human-annotated samples across five datasets and five candidate LLMs, aims to provide a valuable resource for future research in AI evaluation metrics. This study highlights the potential of NLI-based evaluation as a competitive alternative, reinforcing the importance of developing cost-effective and human-aligned metrics in the rapidly evolving field of artificial intelligence.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

Techmeme2 days ago

Chinese toymaker FoloToy suspends sales of its GPT-4o-powered teddy bear, after researchers found the toy gave kids harmful responses, including sexual content (Brandon Vigliarolo/The Register)

NegativeArtificial Intelligence

Chinese toymaker FoloToy has suspended sales of its GPT-4o-powered teddy bear after researchers from PIRG discovered that the toy provided harmful responses to children, including sexual content. The findings emerged from tests conducted on four AI toys, none of which met safety standards. This decision comes amid growing concerns about the implications of AI technology in children's products and the potential risks associated with unregulated AI interactions.

Read full article

via Techmeme

arXiv — cs.CL2 days ago

Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

NeutralArtificial Intelligence

A recent study evaluates the performance of seven advanced large language models (LLMs) on low-resource and morphologically rich languages, specifically Cantonese, Japanese, and Turkish. The research highlights the models' effectiveness in tasks such as open-domain question answering, document summarization, translation, and culturally grounded dialogue. Despite impressive results in high-resource languages, the study indicates that the effectiveness of LLMs in these less-studied languages remains underexplored.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

PositiveArtificial Intelligence

VP-Bench is a newly introduced benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to interpret visual prompts (VPs) in images. This benchmark addresses a significant gap in existing evaluations, as no systematic assessment of MLLMs' effectiveness in recognizing VPs has been conducted. VP-Bench utilizes a two-stage evaluation framework, involving 30,000 visualized prompts across eight shapes and 355 attribute combinations, to assess MLLMs' capabilities in VP perception and utilization.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Semantic VLM Dataset for Safe Autonomous Driving

PositiveArtificial Intelligence

The CAR-Scenes dataset is a newly released frame-level dataset designed for autonomous driving, facilitating the training and evaluation of vision-language models (VLMs) for scene-level understanding. It comprises 5,192 images sourced from Argoverse 1, Cityscapes, KITTI, and nuScenes, annotated using a comprehensive 28-key category/sub-category knowledge base. The dataset includes over 350 attributes and employs a GPT-4o-assisted vision-language pipeline for annotation, ensuring high-quality data through human verification.

Read full article

via arXiv — cs.CV

arXiv — cs.CL2 days ago

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

NeutralArtificial Intelligence

A recent study published on arXiv investigates the use of Large Language Models (LLMs), specifically GPT-4o, for grading short-answer quizzes and project reports in an undergraduate Computational Linguistics course. The research involved approximately 50 students and 14 project teams, comparing LLM-generated scores with evaluations from teaching assistants. Results indicated a strong correlation (up to 0.98) with human graders and exact score agreement in 55% of quiz cases, highlighting both the potential and limitations of LLM-based grading systems.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

NeutralArtificial Intelligence

The paper titled 'Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness' discusses the capabilities of large language models (LLMs) in biomedical natural language processing (NLP) tasks. It highlights the sensitivity of LLMs to demonstration selection and addresses the hallucination issue through retrieval-augmented LLMs (RAL). However, there is a lack of rigorous evaluation of RAL's impact on various biomedical NLP tasks, which complicates understanding its capabilities in this domain.

Read full article

via arXiv — cs.CL