CodeClash: SWE-Bench Team Drops ELO-Based Coding Eval Where AIs Fight in Games
CodeClash, a SWE-bench benchmark, ranks models via six adversarial games using opponent-weighted ELO. It tackles contamination, adversarial measurement, and strategy—prompting OpenAI to drop SWE-bench Verified. Top ELO: Claude Sonnet 4.5 (1385), GPT-5 (1366), o3 (1343); just 19 points separate them. Per-arena: Halite o3 1577, Poker GPT-5 1599, CoreWar Claude 1641. A 175-point gap follows. The leaderboard lacks trajectories, logs, cost data and is locked to Nov 2025. CodeClash joins the SWE-be...
The SWE-bench team just shipped a fundamentally different kind of coding eval — and it might tell us more about how models think than any issue-fixing benchmark ever could.
CodeClash doesn’t ask an AI to patch a Django bug or fix an astropy regression. It drops language models into adversarial coding games — Halite, Poker, CoreWar, Robocode, RobotRumble, BattleSnake — and ranks them by ELO.
The models code. Head to head. The winner climbs.
What Is CodeClash?
CodeClash is a goal-oriented eval, not a task-oriented one. The SWE-bench team announced it alongside their existing suite (Verified, Lite, Multilingual, Multimodal), describing it as “our new evaluation where LMs compete head to head to write the best codebase.” [1]
Each arena is a multi-agent game. A language model writes code for an agent that plays the game against other models’ agents. The agent’s competence — strategy, resource management, adaptability — is a direct reflection of the model’s coding ability. But unlike SWE-bench, there’s no ground-truth patch to match. The eval is the game itself.
The leaderboard currently tracks 6 arenas:
| Game | Type | Top Model |
|---|---|---|
| Halite | Resource-gathering strategy (4p) | o3 (1577 ELO) |
| Poker | Imperfect-information betting | GPT-5 (1599 ELO) |
| CoreWar | Assembly-level battle programming | Claude Sonnet 4.5 (1641 ELO) |
| RobotRumble | Robot combat arena | Claude Sonnet 4.5 (1423 ELO) |
| Robocode | Tank battle coding | GPT-5 (1409 ELO) |
| BattleSnake | Snake-game tournament | Claude Sonnet 4.5 (1470 ELO) |
The ELO Leaderboard (Overall)
Unlike pass/fail benchmarks, CodeClash uses ELO — the rating system from chess. This means that as more models are evaluated, ratings converge on true skill. Wins against stronger opponents are weighted more heavily. The current overall standings: [2]
| Rank | Model | ELO |
|---|---|---|
| 1 | Claude Sonnet 4.5 | 1385 ± 18 |
| 2 | GPT-5 | 1366 ± 17 |
| 3 | o3 | 1343 ± 17 |
| 4 | Claude Sonnet 4 | 1224 ± 17 |
| 5 | GPT-5 Mini | 1199 ± 16 |
| 6 | Gemini 2.5 Pro | 1124 ± 16 |
| 7 | Grok Code Fast | 1006 ± 19 |
| 8 | Qwen3 Coder | 952 ± 20 |
The spread is tight at the top — only 19 ELO points separate Sonnet 4.5 from GPT-5 and o3. Below that, a ~175-point gap separates the frontier from the second tier.
Why This Matters for Eval
CodeClash addresses three significant problems with current coding benchmarks:
1. Contamination resistance. SWE-bench Verified is increasingly compromised, with OpenAI stopping reporting it after detecting contamination across all frontier models [3]. A game-based eval is inherently harder to game — you can’t memorize “the solution to Halite” the way you can memorize a patch for astropy__astropy-14995. Every match is a novel interaction.
2. Adversarial measurement. Task completion evals measure whether a model can solve a problem in isolation. CodeClash measures whether a model can out-compete other models — a harder, more realistic signal. As we noted in our VS Code harness analysis, “the harness defines what the blanks are.” CodeClash redefines the blanks as competitive performance, not patch accuracy.
3. Strategy over syntax. A model that writes compilable code but plays poorly will lose. CodeClash penalizes in-game inefficiency — poor resource allocation, weak defensive strategies, failure to adapt. This is closer to real software engineering than completing a bug fix is.
The Games as Evaluation Probes
Each arena stresses different aspects of coding ability:
-
CoreWar (Claude Sonnet 4.5 leads at 1641): Requires assembly-level optimization, self-replicating code, and defensive programming against live opponents. This is the widest gap between first and second place — 292 ELO.
-
Halite (o3 leads at 1577): A 4-player resource strategy game demanding long-horizon planning and dynamic re-prioritization. o3’s strong lead suggests its reasoning chain helps with temporal resource allocation.
-
Poker (GPT-5 leads at 1599): Imperfect information, bluff detection, expected value calculation. GPT-5’s commanding lead here may reflect its training on game-theoretic reasoning.
-
Robocode/RobotRumble (GPT-5 / Claude Sonnet 4.5): Real-time strategy in constrained environments. These are the most competitive arenas, with narrow spreads.
Cross-Link to SWE-Bench Pro
CodeClash arrives as SWE-bench Pro eclipses the original Verified benchmark. The SWE-bench team now runs four parallel leaderboards: Verified, Lite, Multilingual, Multimodal, and CodeClash. Together, they form the most comprehensive coding eval suite in the open ecosystem.
The progression feels deliberate: first measure bug-fixing (SWE-bench), then measure multi-language capability (Multilingual), then measure multimodal understanding (Multimodal), and now measure adversarial coding skill (CodeClash). Each layer tests a harder skill.
What’s Missing
CodeClash doesn’t yet expose per-instance trajectories, tool-call logs, or cost data — the operational metrics that make VS Code’s harness so analyzable [4]. The leaderboard is also locked to November 2025 data as of this writing; real-time competition could make it much more dynamic.
But as a first release, it’s a shot across the bow of every eval team. If your benchmark can be memorized, it will be. CodeClash’s games can’t be.
References
[1] SWE-bench Team, “CodeClash: Benchmarking Goal-Oriented Software Engineering,” 2025. https://codeclash.ai
[2] CodeClash Leaderboard, accessed May 24, 2026. https://codeclash.ai (source: scraped live data)
[3] OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities,” Feb 23, 2026. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
[4] Julia Kasper, “The Coding Harness Behind GitHub Copilot in VS Code,” VS Code Blog, May 15, 2026. https://code.visualstudio.com/blogs/2026/05/15/agent-harnesses-github-copilot-vscode