UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

arXiv — cs.CL•Wednesday, November 12, 2025 at 5:00:00 AM

The introduction of UniEdit marks a significant step forward in the field of large language models (LLMs), as it provides a unified benchmark designed to overcome the limitations of current editing datasets, which are often restricted to narrow knowledge domains. By leveraging a Neighborhood Multi-hop Chain Sampling (NMCS) algorithm, UniEdit samples subgraphs from 25 common domains, ensuring a broad and comprehensive evaluation of editing demands. This approach not only enhances the accuracy and reliability of LLMs but also addresses the diverse ripple effects that can arise from edits. Proprietary LLMs are employed to convert these sampled knowledge subgraphs into natural language text, ensuring grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale and comprehensiveness of the UniEdit benchmark, while comprehensive experiments across multiple LLMs and editors provide insights into their performance. This initiative is essential for advancing…

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV3 hours ago

Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities

PositiveArtificial Intelligence

The study presents a benchmark for detecting long-term periodic workflows in human activities, addressing a gap in existing research. It includes 580 multimodal activity sequences and supports tasks such as unsupervised workflow detection and procedural anomaly detection. The proposed lightweight model aims to enhance understanding of complex human behaviors over extended periods.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 hours ago

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

NeutralArtificial Intelligence

The paper presents a new metric called Physics-Constrained Multimodal Data Evaluation (PCMDE) aimed at improving the evaluation of multimodal synthetic images. Current metrics like BLEU and CIDEr often fail to accurately assess semantic and structural accuracy, particularly in specific domains. PCMDE integrates large language models with reasoning and vision-language models to enhance feature extraction, validation, and physics-guided reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.LG3 hours ago

Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

PositiveArtificial Intelligence

Hierarchical Token Prepending (HTP) is a proposed method aimed at improving the information flow in decoder-based large language model (LLM) embeddings. Traditional models face limitations due to causal attention mechanisms that hinder backward information flow, particularly affecting long documents. HTP addresses this by introducing block-level summary tokens and replacing last-token pooling with mean-pooling, resulting in enhanced performance across various datasets.

Read full article

via arXiv — cs.LG

arXiv — cs.CL3 hours ago

Bias after Prompting: Persistent Discrimination in Large Language Models

NegativeArtificial Intelligence

A recent study challenges the assumption that biases do not transfer from pre-trained large language models (LLMs) to adapted models. The research indicates that biases can persist through prompting, a common adaptation strategy, with strong correlations observed across demographics such as gender, age, and religion. This finding raises concerns about the effectiveness of current bias mitigation methods in LLMs.

Read full article

via arXiv — cs.CL

arXiv — cs.CL3 hours ago

Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models

PositiveArtificial Intelligence

Instruction tuning is a crucial method for aligning large language models (LLMs) with human intentions and safety requirements. This survey outlines the entire process, including data collection methods, fine-tuning strategies, and evaluation protocols. It categorizes data construction into expert annotation, distillation from larger models, and self-improvement mechanisms, each with unique trade-offs. The study also addresses challenges in evaluating model performance across multilingual and multimodal contexts.

Read full article

via arXiv — cs.CL

arXiv — cs.CL3 hours ago

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

PositiveArtificial Intelligence

LiveCLKTBench is an automated generation pipeline designed to evaluate cross-lingual knowledge transfer in large language models (LLMs). It isolates and measures knowledge transfer by identifying time-sensitive knowledge entities, filtering them based on temporal occurrence, and generating factual questions translated into multiple languages. The evaluation of several LLMs across five languages reveals that cross-lingual transfer is influenced by linguistic distance and is often asymmetric.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Failure to Mix: Large language models struggle to answer according to desired probability distributions

NegativeArtificial Intelligence

Recent research indicates that large language models (LLMs) struggle to generate outputs that align with specified probability distributions. Experiments revealed that when asked to produce binary outputs with a target probability, LLMs consistently failed to meet these expectations, often defaulting to the most probable answer. This behavior undermines the probabilistic exploration necessary for scientific idea generation and selection, raising concerns about the effectiveness of current AI training methodologies.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

PositiveArtificial Intelligence

DataSage is a novel multi-agent framework designed to enhance insight discovery in data analytics. It addresses limitations of existing data insight agents by incorporating external knowledge retrieval, a multi-role debating mechanism, and multi-path reasoning. These features aim to improve the depth of analysis and the accuracy of insights generated, thereby assisting organizations in making informed decisions in a data-driven environment.

Read full article

via arXiv — cs.CL