CodeClash: SWE-Bench Team Drops ELO-Based Coding Eval Where AIs Fight in Games

The SWE-bench team just shipped a fundamentally different coding eval — and it might tell us more about how models think than any issue-fixing benchmark ever could.

CodeClash doesn’t ask an AI to patch a Django bug or fix an astropy regression. It drops language models into adversarial coding games — Halite, Poker, CoreWar, Robocode, RobotRumble, BattleSnake — and ranks them by ELO.

The models code. Head to head. The winner climbs.

What Is CodeClash?

CodeClash is a goal-oriented eval, not a task-oriented one. The SWE-bench team announced it alongside their existing suite (Verified, Lite, Multilingual, Multimodal), describing it as “our new evaluation where LMs compete head to head to write the most effective codebase.” [1]

Each arena is a multi-agent game. A language model writes code for an agent that plays the game against other models’ agents. The agent’s competence — strategy, resource management, adaptability — is a direct reflection of the model’s coding ability. But unlike SWE-bench, there’s no ground-truth patch to match. The eval is the game itself.

The leaderboard currently tracks 6 arenas:

Game	Type	Top Model
Halite	Resource-gathering strategy (4p)	o3 (1577 ELO)
Poker	Imperfect-information betting	GPT-5 (1599 ELO)
CoreWar	Assembly-level battle programming	Claude Sonnet 4.5 (1641 ELO)
RobotRumble	Robot combat arena	Claude Sonnet 4.5 (1423 ELO)
Robocode	Tank battle coding	GPT-5 (1409 ELO)
BattleSnake	Snake-game tournament	Claude Sonnet 4.5 (1470 ELO)

The ELO Leaderboard (Overall)

Unlike pass/fail benchmarks, CodeClash uses ELO — the rating system from chess. This means that as more models are evaluated, ratings converge on true skill. Wins against stronger opponents are weighted more heavily. The current overall standings: [2]

Rank	Model	ELO
1	Claude Sonnet 4.5	1385 ± 18
2	GPT-5	1366 ± 17
3	o3	1343 ± 17
4	Claude Sonnet 4	1224 ± 17
5	GPT-5 Mini	1199 ± 16
6	Gemini 2.5 Pro	1124 ± 16
7	Grok Code Fast	1006 ± 19
8	Qwen3 Coder	952 ± 20

The spread is tight at the top — only 19 ELO points separate Sonnet 4.5 from GPT-5 and o3. Below that, a ~175-point gap separates the frontier from the second tier.

Why This Matters for Eval

CodeClash addresses three significant problems with current coding benchmarks:

1. Contamination resistance. SWE-bench Verified is increasingly compromised, with OpenAI stopping reporting it after detecting contamination across all frontier models [3]. A game-based eval is inherently harder to game — you can’t memorize “the solution to Halite” the way you can memorize a patch for astropy__astropy-14995. Every match is a novel interaction.

2. Adversarial measurement. Task completion evals measure whether a model can solve a problem in isolation. CodeClash measures whether a model can out-compete other models — a harder, more realistic signal. As we noted in our VS Code harness analysis, “the harness defines what the blanks are.” CodeClash redefines the blanks as competitive performance, not patch accuracy.

3. Strategy over syntax. A model that writes compilable code but plays poorly will lose. CodeClash penalizes in-game inefficiency — poor resource allocation, weak defensive strategies, failure to adapt. This is closer to real software engineering than completing a bug fix is.

The Games as Evaluation Probes

Each arena stresses different aspects of coding ability:

CoreWar (Claude Sonnet 4.5 leads at 1641): Requires assembly-level optimization, self-replicating code, and defensive programming against live opponents. This is the widest gap between first and second place — 292 ELO.
Halite (o3 leads at 1577): A 4-player resource strategy game demanding long-horizon planning and dynamic re-prioritization. o3’s strong lead suggests its reasoning chain helps with temporal resource allocation.
Poker (GPT-5 leads at 1599): Imperfect information, bluff detection, expected value calculation. GPT-5’s commanding lead here may reflect its training on game-theoretic reasoning.
Robocode/RobotRumble (GPT-5 / Claude Sonnet 4.5): Real-time strategy in constrained environments. These are the most competitive arenas, with narrow spreads.

Cross-Link to SWE-Bench Pro

CodeClash arrives as SWE-bench Pro eclipses the original Verified benchmark. The SWE-bench team now runs four parallel leaderboards: Verified, Lite, Multilingual, Multimodal, and CodeClash. Together, they form the most comprehensive coding eval suite in the open ecosystem.

The progression feels deliberate: first measure bug-fixing (SWE-bench), then measure multi-language capability (Multilingual), then measure multimodal understanding (Multimodal), and now measure adversarial coding skill (CodeClash). Each layer tests a harder skill.

What’s Missing

CodeClash doesn’t yet expose per-instance trajectories, tool-call logs, or cost data — the operational metrics that make VS Code’s harness so analyzable [4]. The leaderboard is also locked to November 2025 data as of this writing; real-time competition could make it much more dynamic.

But as a first release, it’s a shot across the bow of every eval team. If your benchmark can be memorized, it will be. CodeClash’s games can’t be.

References

[1] SWE-bench Team, “CodeClash: Benchmarking Goal-Oriented Software Engineering,” 2025. https://codeclash.ai

[2] CodeClash Leaderboard, accessed May 24, 2026. https://codeclash.ai (source: scraped live data)

[3] OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities,” Feb 23, 2026. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

[4] Julia Kasper, “The Coding Harness Behind GitHub Copilot in VS Code,” VS Code Blog, May 15, 2026. https://code.visualstudio.com/blogs/2026/05/15/agent-harnesses-github-copilot-vscode

What Is CodeClash?

The ELO Leaderboard (Overall)

Why This Matters for Eval

The Games as Evaluation Probes

Cross-Link to SWE-Bench Pro

What’s Missing

Related References

Terminal-Bench v2.1: A Benchmark Study of CLI-Based AI Agent Coding

SWE-Bench Verified Is Dead — Long Live SWE-Bench Pro

Context Engineering for AI Coding Agents: 9 Techniques That Actually Work