MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
MENLO is a newly developed framework aimed at enhancing the evaluation of native-like quality in responses generated by large language models (LLMs) across 47 different languages. By creating a dataset of 6,423 human-annotated prompt-response pairs, MENLO assesses four quality dimensions with high inter-annotator agreement. The findings indicate that LLM judges, although benefiting from pairwise evaluations and structured rubrics, still do not match the performance of human annotators. The research suggests that fine-tuning LLMs through reinforcement learning, reward shaping, and multi-task learning can lead to significant improvements in their multilingual proficiency. However, discrepancies with human judgment persist, indicating that while progress is being made, further refinement is necessary. The release of the MENLO dataset and evaluation framework is expected to support ongoing research in scalable multilingual evaluation and preference alignment.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
EvoLM: In Search of Lost Language Model Training Dynamics
PositiveArtificial Intelligence
EvoLM is a new model suite designed to analyze the training dynamics of language models (LMs) across various stages, including pre-training and fine-tuning. By training over 100 LMs with 1B and 4B parameters, EvoLM provides insights into the effectiveness of design choices and their impact on both language modeling and problem-solving capabilities. Key findings emphasize the diminishing returns of excessive pre-training and the importance of continued pre-training to mitigate forgetting during domain-specific tasks.
Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm
PositiveArtificial Intelligence
The paper titled 'Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm' addresses vulnerabilities in sequential recommenders, particularly to adversarial attacks. It highlights the Profile Pollution Attack (PPA), which subtly contaminates user interactions to induce mispredictions. The authors propose a new method called CREAT, which combines bi-level optimization with reinforcement learning to enhance the stealthiness and effectiveness of such attacks, overcoming limitations of previous methods.
LDC: Learning to Generate Research Idea with Dynamic Control
PositiveArtificial Intelligence
Recent advancements in large language models (LLMs) highlight their potential in automating scientific research ideation. Current methods often produce ideas that do not meet expert standards of novelty, feasibility, and effectiveness. To address these issues, a new framework is proposed that combines Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) to enhance the quality of generated research ideas through a two-stage approach.
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning
PositiveArtificial Intelligence
The paper titled 'Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning' addresses the challenges of high-variance return estimates in reinforcement learning algorithms. It highlights that well-designed behavior policies can collect off-policy data, leading to lower variance return estimates. This finding suggests that on-policy data collection is not optimal for variance, and the authors extend this insight to online reinforcement learning, where policy evaluation and improvement occur simultaneously.
ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries
PositiveArtificial Intelligence
ExPairT-LLM is introduced as an exact learning algorithm aimed at improving code selection from multiple outputs generated by large language models (LLMs). Traditional code selection algorithms often struggle to identify the correct program due to misidentification of nonequivalent programs or reliance on LLMs that may not always provide accurate outputs. ExPairT-LLM addresses these issues by utilizing pairwise membership and pairwise equivalence queries, enhancing the accuracy of program selection. Evaluations show a significant improvement in success rates over existing algorithms.
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
PositiveArtificial Intelligence
The article presents Thinker, a hierarchical thinking model designed to enhance the reasoning capabilities of large language models (LLMs) through multi-turn interactions. Unlike previous methods that relied on end-to-end reinforcement learning without supervision, Thinker allows for a more structured reasoning process by breaking down complex problems into manageable sub-problems. Each sub-problem is represented in both natural language and logical functions, improving the coherence and rigor of the reasoning process.
From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models
NeutralArtificial Intelligence
Recent advancements in large language models (LLMs) have shifted the focus of reasoning as a benchmark for intelligence evaluation. This article critiques the uniform reasoning strategies employed by current LLMs, which often generate lengthy reasoning for simple tasks while struggling with complex ones. It introduces the concept of adaptive reasoning, which allows models to adjust their reasoning efforts based on task difficulty and uncertainty, and outlines three key contributions to understanding this approach.