OpenAI Retires SWE-bench Verified: The Coding Benchmark Was Broken All Along

The Benchmark Everyone Used Is Now Officially Retired
Since August 2024, SWE-bench Verified has been the industry's gold standard for measuring AI coding capability. Every major model release — from OpenAI, Anthropic, Google, and others — has cited it. Now OpenAI says it's broken, and they're done reporting scores on it.
In a detailed technical post, OpenAI identified two fatal flaws that make SWE-bench Verified unsuitable for measuring real progress in autonomous software engineering:
- Broken test design — Nearly 60% of the hardest problems in the dataset have material issues in their test design or problem descriptions, making them extremely difficult or impossible to solve even for the best models or experienced human engineers.
- Training data contamination — Because SWE-bench Verified is publicly available, it has been incorporated into the training data of virtually every major AI model. Models are not solving these problems from capability — they're recalling memorized answers.
The conclusion is stark: improvements on SWE-bench Verified over the past six months have not reflected genuine advances in AI coding ability. They have reflected how much exposure the model had to the benchmark during training.
The Contamination Evidence Is Damning
OpenAI didn't just theorize about contamination — they tested it. Using GPT-5 as a red-teaming agent, they probed GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview to see if they had memorized benchmark answers.
The results were unambiguous:
- GPT-5.2 — Given only a snippet from a task description, reproduced the exact gold patch, including the precise class name, method name, and the specific early return condition introduced in the solution.
- Claude Opus 4.5 — Recalled not only the exact 4-line functional change from the PR, but also quoted verbatim the inline comment that was part of the diff.
- Gemini 3 Flash — Given only the task ID (no other information), reproduced verbatim details from the task description and gold patch, including the exact regex formula for username validation and specific line numbers.
This is not a subtle signal. These models have seen the answers. The leaderboard has been a memorization contest.
Why the Benchmark Broke Down
SWE-bench Verified was created to fix problems with the original SWE-bench dataset. Expert software engineers reviewed 1,699 problems and filtered them down to 500 that were considered solvable and well-specified. But two residual problems persisted:
Too-narrow tests: Some problems require implementing a specific function name that isn't mentioned in the problem description but is imported directly by the tests. Models produce valid solutions that fail on import errors — through no fault of their own.
Too-wide tests: Other problems are sourced from PRs that fixed multiple issues, but the task description only covers one of them. Models correctly solve the described problem but fail tests that check for the other fixes.
OpenAI's audit of 138 problems that their best model couldn't consistently solve found that 59.4% contained material issues — not model limitations, but benchmark flaws.
What Comes Next
OpenAI is now recommending SWE-bench Pro as the replacement. It's not perfect — their contamination pipeline found some cases — but they are significantly rarer and less egregious than in SWE-bench Verified. No model was able to produce a complete verbatim gold patch from SWE-bench Pro.
Longer term, OpenAI says the industry needs privately authored benchmarks — tasks created by domain experts that never appear in public training data, with solutions evaluated holistically by trained reviewers rather than automated test suites. It's resource-intensive, but increasingly necessary.
The Bottom Line
When the ruler is wrong, every measurement is wrong. For 18 months, the AI industry has been citing a benchmark that was broken at both ends — faulty test design and widespread contamination. This doesn't mean AI coding progress isn't real. It means the industry's primary instrument for measuring that progress was giving false readings.
OpenAI deserves credit for publishing this analysis publicly and recommending that all model developers stop using SWE-bench Verified. Now the question is whether the rest of the industry follows.