SWE-Bench Verified Is Dead — Long Live SWE-Bench Pro

OpenAI stopped reporting SWE-Bench Verified after auditing 138 problems with six or more engineers; 35.5% had narrow tests (e.g., pylint task importing exact function name) and 18.8% wide tests, totaling 59.4% flawed. Contamination was confirmed: Gemini 3 Flash reproduced the django__django-11099 diff from its ID. The replacement, Scale AI's SWE-Bench Pro, has 1,865 tasks from 41 repositories, averaging 107 lines changed. On it, Claude Opus 4.5 scores 45.9% with standardized scaffolding, but ...

In late February 2026, OpenAI dropped a bombshell: it would no longer report scores on SWE-Bench Verified, the benchmark that had been the de facto standard for measuring AI coding ability for over a year. The reason — two critical flaws that undermine nearly every leaderboard number published since Verified launched.

If you’ve been following the “80% on SWE-bench” milestones [1], this changes how you should read them.

What OpenAI Found

OpenAI audited 138 problems from SWE-Bench Verified that o3 couldn’t solve consistently. Each was reviewed by six or more experienced engineers. The results were damning:

  • 35.5% had “narrow” tests [1] — they enforced implementation details (like exact function names) that weren’t in the problem description.
  • 18.8% had “wide” tests [1] — they checked for functionality not mentioned anywhere in the issue.
  • Total: 59.4% of hard problems have materially flawed test cases. [1]

A single example tells the story. In the pylint-dev/pylint-4551 task, the PR introduced a get_annotation function. The test imported get_annotation directly. Any correct solution that used a different function name failed with an ImportError. The model was being graded on guessing the exact implementation from the training data, not on solving the problem.

Contamination Confirmed

Worse, every frontier model tested — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — could reproduce verbatim gold patches from training data exposure. Given only a task ID, Gemini 3 Flash reconstructed the full diff for django__django-11099, including exact line numbers. The benchmark had become a memory test.

Enter SWE-Bench Pro

Scale AI’s SWE-Bench Pro is now the recommended replacement, and it is substantially harder:

DimensionSWE-Bench VerifiedSWE-Bench Pro
Tasks5001,865
Repositories12 (Python only)41 (Python, Go, TS, JS)
Avg lines changed11 (median: 4)107.4
Top score (May 2026)80.9%~57%

Claude Opus 4.5 scores 80.9% on Verified and 45.9% on Pro’s standardized SEAL leaderboard [2]. Same model, half the score. That gap isn’t noise — it’s the difference between testing on memorized patterns and testing on actual engineering ability.

What This Means for Eval Harness Design

The SWE-Bench Verified saga teaches several hard lessons for anyone building eval harnesses:

  1. Public datasets will be contaminated. If your benchmark problems are in public repos, frontier models have seen them. Password-protected datasets and canary strings are table stakes. SWE-Bench Pro uses held-out and proprietary codebases explicitly to resist this.

  2. Test cases must be implementation-agnostic. If your tests import a specific function name from the expected solution, you’re not testing correctness — you’re testing retrieval. The fix: structure tests around behavior, not implementation.

  3. Scaffolding matters more than the model. On SWE-Bench Pro, Claude Opus 4.5 scored 45.9% with standardized scaffolding but 55.4% with Anthropic’s custom Claude Code scaffold [2] — a 9.5 point gap purely from context management and tooling. Three different agent systems running the same model scored between 50.2% and 55.4%. The agent architecture is now the differentiator.

The New Leaderboard Landscape

As of May 2026, the authoritative leaderboards are:

  • SEAL Leaderboard (Scale AI) — standardized scaffolding, isolates model capability. Top: Claude Opus 4.5 at 45.9%.
  • Agent Systems Leaderboard — custom scaffolds, measures real-world agent performance. Top: GPT-5.3-Codex CLI at 57.0%.
  • BenchLM.ai — still tracks SWE-Bench Verified for reference, with Claude Mythos Preview leading at 93.9% (though these numbers should be interpreted with caution given the contamination findings).

The Takeaway

SWE-Bench Verified’s retirement isn’t a scandal — it’s a maturation signal. The coding eval space is moving toward multi-language, long-horizon, contamination-resistant benchmarks. For anyone building or evaluating AI coding agents, SWE-Bench Pro is now the target to track.

References

[1] OpenAI — Why We No Longer Evaluate on SWE-Bench Verified [2] Scale AI — SEAL Leaderboard [3] Morph LLC — SWE-Bench Pro Agent Systems

Cover image suggestion: A split diagram…