World PulseNowPowered by AI

Trending:

From Confidence to Collapse in LLM Factual Robustness

arXiv — cs.CL•Friday, November 21, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A new approach to measuring factual robustness in large language models (LLMs) has been introduced, focusing on the generation process rather than just performance metrics. This method utilizes token distribution entropy and temperature scaling sensitivity to create the Factual Robustness Score (FRS).
The development of the FRS is significant as it addresses the limitations of existing evaluation methods, enhancing the reliability of LLMs in critical applications such as question answering.
This advancement highlights ongoing discussions about the evaluation of LLMs, emphasizing the need for metrics that reflect real

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Continue Readings

Large language models and research progress: Q&A with an aerospace engineer

Tech Xplore — AI & ML2 days ago

Large language models and research progress: Q&A with an aerospace engineer

NeutralArtificial Intelligence

The rapid expansion of large language models' (LLMs) capabilities—including web search, code execution, data analysis, and hypothesis generation—is outpacing critical reflection on their role in academic research. This raises questions about the implications of LLMs in various fields and the need for a more structured approach to their integration into research methodologies.

Read full article

via Tech Xplore — AI & ML

LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation

arXiv — cs.LG2 days ago

LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation

PositiveArtificial Intelligence

The paper introduces LLMInit, a scalable framework that integrates pre-trained large language model (LLM) embeddings into collaborative filtering (CF) models to address cold-start and data-sparse issues in recommendation systems. By employing selective initialization strategies and efficient sampling methods, LLMInit aims to enhance the performance of CF models while mitigating embedding collapse challenges associated with large LLMs.

Read full article

via arXiv — cs.LG

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

arXiv — cs.CL2 days ago

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

PositiveArtificial Intelligence

JudgeBoard is a new evaluation pipeline designed to assess the correctness of candidate answers generated by small language models (SLMs) without requiring comparisons to ground-truth labels. This method aims to enhance the evaluation of reasoning tasks, particularly in mathematical and commonsense reasoning domains. The approach seeks to provide a more direct and scalable means of evaluating reasoning outputs compared to traditional large language models (LLMs) frameworks.

Read full article

via arXiv — cs.CL

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

arXiv — cs.CL2 days ago

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

PositiveArtificial Intelligence

The paper presents SDA (Steering-Driven Distribution Alignment), a model-agnostic framework aimed at aligning large language models (LLMs) with human intent without the need for fine-tuning. As LLMs are increasingly deployed in various applications, ensuring their responses meet user expectations is crucial. SDA dynamically adjusts model output probabilities based on user-defined instructions, addressing the challenge of alignment during inference efficiently and cost-effectively.

Read full article

via arXiv — cs.CL

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

arXiv — cs.LG2 days ago

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

PositiveArtificial Intelligence

PepThink-R1 is a generative framework designed for optimizing therapeutic cyclic peptides by integrating large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). This approach allows for interpretable design choices and enhances multiple pharmacological properties, such as lipophilicity and stability, by autonomously exploring diverse sequence variants guided by a tailored reward function.

Read full article

via arXiv — cs.LG

AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

arXiv — cs.CV2 days ago

AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

PositiveArtificial Intelligence

AMS-KV introduces an adaptive Key and Value (KV) caching mechanism for multi-scale visual autoregressive transformers, addressing the challenges of excessive memory growth in next-scale predictions. The study reveals that local scale token attention enhances generation quality, while allocating minimal memory for coarsest scales stabilizes image generation. The findings emphasize the importance of cache-efficient layers in maintaining strong KV similarity across finer scales.

Read full article

via arXiv — cs.CV

Verbalized Algorithms

arXiv — cs.CL2 days ago

Verbalized Algorithms

PositiveArtificial Intelligence

The concept of verbalized algorithms (VAs) is introduced as a method to enhance the reliability of large language models (LLMs) in reasoning tasks. VAs break down complex tasks into simpler operations on natural language strings, allowing LLMs to function effectively within a limited scope. An example provided is verbalized sorting, which utilizes an LLM as a binary comparison oracle within a known sorting algorithm, demonstrating effectiveness in sorting and clustering tasks.

Read full article

via arXiv — cs.CL

Liars' Bench: Evaluating Lie Detectors for Language Models

arXiv — cs.CL2 days ago

Liars' Bench: Evaluating Lie Detectors for Language Models

NeutralArtificial Intelligence

The article introduces LIARS' BENCH, a comprehensive testbed designed to evaluate lie detection techniques in large language models (LLMs). It consists of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. The study reveals that existing lie detection methods often fail to identify certain types of lies, particularly when the model's deception cannot be discerned from the transcript alone, highlighting limitations in current techniques.

Read full article

via arXiv — cs.CL