After reviewing thousands of benchmarks used in AI development, a Stanford team found that 5% could have serious flaws with far-reaching ramifications.

اكتشف فريق من جامعة ستانفورد أن حوالي 5% من المعايير المستخدمة في تطوير الذكاء الاصطناعي قد تحتوي على عيوب خطيرة، مما قد يكون له آثار بعيدة المدى على موثوقية أنظمة الذكاء الاصطناعي. شملت هذه المراجعة تحليلًا شاملاً لآلاف المعايير، مما أثار مخاوف بشأن نزاهة تقييمات الذكاء الاصطناعي.

Un equipo de Stanford ha identificado que aproximadamente el 5% de los benchmarks utilizados en el desarrollo de IA pueden contener fallos significativos, lo que podría tener graves implicaciones para la fiabilidad de los sistemas de IA. Esta revisión implicó un análisis exhaustivo de miles de benchmarks, lo que plantea preocupaciones sobre la integridad de las evaluaciones de IA.

Une équipe de Stanford a identifié qu'environ 5 % des benchmarks utilisés dans le développement de l'IA pourraient contenir des défauts significatifs, ce qui pourrait avoir de graves implications pour la fiabilité des systèmes d'IA. Cette revue a impliqué une analyse approfondie de milliers de benchmarks, soulevant des préoccupations concernant l'intégrité des évaluations de l'IA.

A Stanford team has identified that approximately 5% of the benchmarks used in AI development may contain significant flaws, which could have serious implications for the reliability of AI systems. This review involved an extensive analysis of thousands of benchmarks, raising concerns about the integrity of AI evaluations.

Squashing 'fantastic bugs' hidden in AI benchmarks

arXiv:2512.10449v1 Announce Type: cross 
Abstract: The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

أظهرت دراسة حديثة ضعف المراجعين العلميين المعتمدين على نماذج اللغة الكبيرة (LLM) أمام حقن الطلبات غير المباشرة، مع التركيز على إمكانية تغيير قرارات المراجعة من 'رفض' إلى 'قبول'. تقدم هذه الدراسة مقياسًا جديدًا يُعرف باسم Weighted Adversarial Vulnerability Score (WAVS)، وتقيّم 15 استراتيجية هجوم عبر 13 نموذج لغة، بما في ذلك GPT-5 وDeepSeek، باستخدام مجموعة بيانات تضم 200 ورقة علمية.

Un estudio reciente ha examinado la vulnerabilidad de los revisores científicos basados en Modelos de Lenguaje Grande (LLM) ante la inyección de prompts indirectos, centrándose en la posibilidad de alterar las decisiones de revisión de 'Rechazar' a 'Aceptar'. Esta investigación introduce una nueva métrica, el Weighted Adversarial Vulnerability Score (WAVS), y evalúa 15 estrategias de ataque en 13 LLM, incluyendo GPT-5 y DeepSeek, utilizando un conjunto de datos de 200 artículos científicos.

Une étude récente a examiné la vulnérabilité des examinateurs scientifiques basés sur des modèles de langage (LLM) face à l'injection de prompts indirects, en se concentrant sur la possibilité de modifier les décisions d'examen par les pairs de 'Rejeter' à 'Accepter'. Cette recherche introduit une nouvelle métrique, le Weighted Adversarial Vulnerability Score (WAVS), et évalue 15 stratégies d'attaque à travers 13 LLM, y compris GPT-5 et DeepSeek, en utilisant un ensemble de données de 200 articles scientifiques.

A recent study has examined the vulnerability of Large Language Model (LLM)-based scientific reviewers to indirect prompt injection, focusing on the potential to alter peer review decisions from 'Reject' to 'Accept'. This research introduces a new metric, the Weighted Adversarial Vulnerability Score (WAVS), and evaluates 15 attack strategies across 13 LLMs, including GPT-5 and DeepSeek, using a dataset of 200 scientific papers.

Squashing 'fantastic bugs' hidden in AI benchmarks

Was this article worth reading? Share it

Ready to build your own newsroom?