arXiv:2505.20411v2 Announce Type: replace-cross 
Abstract: LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

SWE-rebench يقدم خط أنابيب آلي مصمم لتحسين تقييم وكلاء هندسة البرمجيات. يتناول التحدي الحاسم المتمثل في الحصول على بيانات تدريب عالية الجودة تعكس السيناريوهات الواقعية، مما يمكّن الوكلاء من التفاعل بفعالية مع بيئات التطوير وتكييف سلوكهم بناءً على النتائج.

SWE-rebench presenta un pipeline automatizado diseñado para mejorar la evaluación de agentes de ingeniería de software. Aborda el desafío crítico de obtener datos de entrenamiento de alta calidad que reflejen escenarios del mundo real, permitiendo a los agentes interactuar eficazmente con los entornos de desarrollo y adaptar su comportamiento según los resultados.

SWE-rebench présente un pipeline automatisé conçu pour améliorer l'évaluation des agents en ingénierie logicielle. Il répond au défi crucial d'obtenir des données d'entraînement de haute qualité qui reflètent des scénarios réels, permettant aux agents d'interagir efficacement avec les environnements de développement et d'adapter leur comportement en fonction des résultats.

SWE-rebench introduces an automated pipeline designed to enhance the evaluation of software engineering agents. It addresses the critical challenge of obtaining high-quality training data that mirrors real-world scenarios, enabling agents to effectively interact with development environments and adapt their behavior based on outcomes.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

One More Thing in AI – Your Shortcut to AI Mastery

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Was this article worth reading? Share it

One More Thing in AI

Chattermate

LucidQuery AI

Legion AI

Teammately

Epsilla

Ready to build your own newsroom?