Do Large Language Models Truly Understand Cross-cultural Differences?

arXiv — cs.CL•Tuesday, December 9, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Recent research highlights the limitations of existing benchmarks in evaluating large language models' (LLMs) cross-cultural understanding, proposing a new benchmark called SAGE that incorporates scenario-based assessments and cultural theory. This benchmark categorizes cross-cultural capabilities into nine dimensions and includes 210 core concepts across 15 real-world scenarios.
The development of SAGE is significant as it aims to enhance the evaluation of LLMs' ability to understand and reason about cross-cultural differences, a crucial competency for their effective application in multilingual tasks.
This initiative reflects a broader trend in AI research focusing on the alignment of LLMs with human values and cultural nuances, addressing concerns about their reliability and fairness in sensitive applications, while also highlighting the ongoing debate about their limitations in symbolic reasoning and contextual comprehension.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Langtail

Build and deploy robust LLM applications quickly with your team.

Business & ProductivityView app details

Langfuse

Debug, monitor, and improve your complex LLM applications with ease.

Tech & Developer ToolsView app details

Continue Readings

arXiv — cs.CL20 hours ago

Short-Context Dominance: How Much Local Context Natural Language Actually Needs?

NeutralArtificial Intelligence

The study investigates the short-context dominance hypothesis, suggesting that a small local prefix can often predict the next tokens in sequences. Using large language models, researchers found that 75-80% of sequences from long-context documents only require the last 96 tokens for accurate predictions, leading to the introduction of a new metric called Distributionally Aware MCL (DaMCL) to identify challenging long-context sequences.

Read full article

via arXiv — cs.CL

arXiv — cs.CL20 hours ago

What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

NeutralArtificial Intelligence

A recent study published on arXiv explores the interpretability of machine translation models, particularly focusing on how gender bias manifests in translation choices. By utilizing contrastive explanations and saliency attribution, the research investigates the influence of context, specifically input tokens, on the gender inflection selected by translation models. This approach aims to uncover the origins of gender bias rather than merely measuring its presence.

Read full article

via arXiv — cs.CL

arXiv — cs.CL20 hours ago

Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models

PositiveArtificial Intelligence

A new study has introduced a soft inductive bias approach to enhance inappropriate utterance detection in conversational texts using large language models (LLMs), specifically focusing on Korean corpora. This method aims to define explicit reasoning perspectives to guide inference processes, thereby improving rational decision-making and reducing errors in detecting inappropriate remarks.

Read full article

via arXiv — cs.CL

arXiv — cs.CL20 hours ago

QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

PositiveArtificial Intelligence

QSTN has been introduced as an open-source Python framework designed to generate responses from questionnaire-style prompts, facilitating in-silico surveys and annotation tasks with large language models (LLMs). The framework allows for robust evaluation of questionnaire presentation and response generation methods, based on an extensive analysis of over 40 million survey responses.

Read full article

via arXiv — cs.CL

arXiv — cs.CL20 hours ago

Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic

NeutralArtificial Intelligence

The evaluation of large language models (LLMs) has been enhanced by introducing Balanced Accuracy as a metric, which is theoretically aligned with Youden's J statistic. This approach addresses the limitations of traditional metrics like Accuracy and Precision, which can be skewed by class imbalances and arbitrary positive class selections. By utilizing Balanced Accuracy, the selection of judges for model comparisons becomes more reliable and robust.

Read full article

via arXiv — cs.CL

arXiv — cs.CL20 hours ago

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

PositiveArtificial Intelligence

The introduction of Omniguard presents a novel approach to AI safety moderation, specifically targeting the detection of harmful prompts across various languages and modalities. This method enhances the accuracy of harmful prompt classification by 11.57% compared to existing baselines, addressing concerns about the misuse of large language models (LLMs) and their susceptibility to attacks that exploit language and modality mismatches.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Evaluating Long-Term Memory for Long-Context Question Answering

NeutralArtificial Intelligence

A systematic evaluation of memory-augmented methods for long-context dialogues has been conducted, focusing on large language models (LLMs) and their effectiveness in question-answering tasks. The study highlights various memory types, including semantic, episodic, and procedural memory, and their impact on reducing token usage while maintaining accuracy.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

NeutralArtificial Intelligence

A new large-scale multimodal dataset named CUHK-X has been introduced to enhance human activity recognition (HAR) and reasoning capabilities. This dataset addresses the limitations of existing datasets by providing fine-grained data-label annotations and textual descriptions, which are crucial for understanding and reasoning about human actions in various contexts.

Read full article

via arXiv — cs.CV