SWE-Bench Pro: enterprise-grade software engineering benchmarks
Abstract
We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-Bench but is specifically designed to capture realistic, complex, enterprise-level problems beyond the scope of earlier evaluations.
1. Introduction
Modern code-generation systems are typically evaluated on isolated functions or short-form competitive programming tasks. These setups miss the messy reality of enterprise software, where issues span many files, depend on third-party APIs, and require reading thousands of lines of context.
SWE-Bench Pro adds 412 issues from 23 repositories, each annotated with a human-verified patch and a regression-test suite. Tasks are filtered so that the median solution touches 4.2 files and 137 lines of code, roughly an order of magnitude more than the original SWE-Bench.
2. Benchmark construction
We sampled candidate issues from the public bug-tracker history of widely-used Python and TypeScript projects. Each candidate was scored by a reproducibility heuristic (Pass@1 on the regression suite when the ground-truth patch is applied) and reviewed by two annotators.
3. Results
On Pass@1, the strongest current model resolves 18.4% of SWE-Bench Pro issues, compared to 49.7% on the original SWE-Bench. Failure analysis shows that long-context retrieval and multi-file edits are the dominant bottlenecks; raw reasoning is rarely the limiting factor.
