MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
PositiveArtificial Intelligence
MENLO is a newly developed framework aimed at enhancing the evaluation of native-like quality in responses generated by large language models (LLMs) across 47 different languages. By creating a dataset of 6,423 human-annotated prompt-response pairs, MENLO assesses four quality dimensions with high inter-annotator agreement. The findings indicate that LLM judges, although benefiting from pairwise evaluations and structured rubrics, still do not match the performance of human annotators. The research suggests that fine-tuning LLMs through reinforcement learning, reward shaping, and multi-task learning can lead to significant improvements in their multilingual proficiency. However, discrepancies with human judgment persist, indicating that while progress is being made, further refinement is necessary. The release of the MENLO dataset and evaluation framework is expected to support ongoing research in scalable multilingual evaluation and preference alignment.
— via World Pulse Now AI Editorial System
