LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

arXiv — cs.CLThursday, November 20, 2025 at 5:00:00 AM
  • LiveCLKTBench has been introduced to enhance the evaluation of cross
  • This development is significant as it provides a systematic approach to assess LLMs, which are increasingly utilized in diverse applications, ensuring their reliability across different languages.
  • The findings highlight ongoing challenges in LLM performance, particularly regarding linguistic distance and transfer asymmetry, which resonate with broader discussions on the effectiveness and limitations of LLMs in multilingual contexts.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning
PositiveArtificial Intelligence
HSKBenchmark introduces a novel benchmark for modeling and assessing Chinese second language acquisition (SLA) using large language models (LLMs). This benchmark addresses the challenges of traditional language acquisition experiments, which are often impractical and ethically complex. HSKBenchmark encompasses HSK levels 3 to 6, featuring authentic textbooks and a comprehensive evaluation system, thereby enhancing the interpretability and scalability of LLMs in SLA.
Bias after Prompting: Persistent Discrimination in Large Language Models
NegativeArtificial Intelligence
A recent study challenges the assumption that biases do not transfer from pre-trained large language models (LLMs) to adapted models. The research indicates that biases can persist through prompting, a common adaptation strategy, with strong correlations observed across demographics such as gender, age, and religion. This finding raises concerns about the effectiveness of current bias mitigation methods in LLMs.
ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM-based Progression
PositiveArtificial Intelligence
ProRAC (Progression-based Reasoning about Actions and Change) is a neuro-symbolic framework that utilizes large language models (LLMs) to address reasoning about actions and changes (RAC) problems. The framework extracts essential elements from RAC problems, executes actions progressively to determine the final state, and evaluates queries against this state. Evaluations on various RAC benchmarks indicate that ProRAC demonstrates strong performance across diverse tasks and domains.
Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models
PositiveArtificial Intelligence
Large language models (LLMs) have shown impressive capabilities across various tasks, but their extensive size complicates real-world applications. Traditional pruning methods, like Wanda, require significant manual effort and expert knowledge, leading to high costs. This study introduces AutoPrune, a self-pruning method that allows LLMs to autonomously design optimal pruning algorithms, addressing the challenges of expert dependency and performance degradation due to uniform sparsity.
Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings
PositiveArtificial Intelligence
Hierarchical Token Prepending (HTP) is a proposed method aimed at improving the information flow in decoder-based large language model (LLM) embeddings. Traditional models face limitations due to causal attention mechanisms that hinder backward information flow, particularly affecting long documents. HTP addresses this by introducing block-level summary tokens and replacing last-token pooling with mean-pooling, resulting in enhanced performance across various datasets.
Investigating Hallucination in Conversations for Low Resource Languages
NeutralArtificial Intelligence
Large Language Models (LLMs) have shown exceptional ability in text generation but often produce factually incorrect statements, known as 'hallucinations'. This study investigates hallucinations in conversational data across three low-resource languages: Hindi, Farsi, and Mandarin. The analysis of various LLMs, including GPT-3.5 and GPT-4o, reveals that while Mandarin has few hallucinated responses, Hindi and Farsi exhibit significantly higher rates of inaccuracies.
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
PositiveArtificial Intelligence
MedBench v4 introduces a comprehensive benchmarking framework for evaluating Chinese medical language models, multimodal models, and intelligent agents. This cloud-based infrastructure features over 700,000 expert-curated tasks across various medical specialties. The evaluation process includes multi-stage refinement and clinician reviews, with results indicating that while base LLMs score an average of 54.1/100, safety and ethics ratings remain low at 18.4/100.
HalluClean: A Unified Framework to Combat Hallucinations in LLMs
PositiveArtificial Intelligence
HalluClean is a new framework designed to detect and correct hallucinations in large language models (LLMs). This task-agnostic approach enhances the reliability of LLM-generated text by decomposing the process into planning, execution, and revision stages. HalluClean utilizes minimal task-routing prompts for zero-shot generalization across various domains, significantly improving factual consistency in outputs.