Terminal-Bench v2.1: A Benchmark Study of CLI-Based AI Agent Coding

Terminal-Bench measures something fundamentally different from SWE-Bench: not whether a model can patch a bug in a codebase, but whether it can operate a computer through a shell. The distinction matters because the two test orthogonal capabilities — and the gap between them reveals where agent architectures still fail.

In June 2026, the benchmark received its v2.1 refresh: 89 curated tasks spanning five categories, with environment and instruction fixes designed to ensure scores reflect agent capability rather than environment gaps [1][2]. This post provides an empirical analysis of the verified leaderboard, the task taxonomy, harness interaction effects, and what the results mean for inference-time scaling strategies.

Benchmark Design

Terminal-Bench v2.1 evaluates AI agents on real command-line tasks in Docker sandboxes. Each task provides a pre-configured environment with minimal scaffolding — the agent must figure out the rest. The benchmark ships via the Harbor framework, and all submissions run under standardized timeouts and resource limits [1].

The 89 tasks break into five categories:

Software Engineering — Build systems, compilation, cross-compilation (e.g., “Cross-compile Doom for MIPS”), dependency resolution
System Administration — Package management, service configuration, user management, firewall rules
Data Processing — File format conversion, ETL pipelines, log parsing, data validation
Model Training — ML environment setup, training script debugging, checkpoint management
Security — Permission audits, certificate management, basic cryptography tasks

Each task has a natural-language instruction and machine-checkable success criteria — stdout output, file existence, file content patterns, or process exit codes. The v2.1 refresh specifically fixed environment issues (missing packages, broken Docker images, incorrect instruction parsing) that caused false negatives in v2.0 [2].

Verified Leaderboard — June 2026

The official Terminal-Bench v2.1 leaderboard [1] tracks 13 verified entries with confidence intervals. Here is the current ranking:

Rank	Agent	Model	Score	±CI	Date
1	Codex CLI	GPT-5.5	83.4%	±2.2	2026-05-01
2	Claude Code	Claude 5 Fable	83.1%	±2.0	2026-06-17
3	Terminus 2	Claude 5 Fable	80.4%	±2.3	2026-06-17
4	Claude Code	Claude Opus 4.8	78.9%	±2.5	2026-05-29
5	Terminus 2	GPT-5.5	78.2%	±2.4	2026-05-01
6	Terminus 2	Claude Opus 4.8	74.6%	±2.4	2026-05-29
7	Terminus 2	Gemini 3 Pro	74.4%	±2.6	2026-05-01
8	Gemini CLI	Gemini 3.1 Pro	70.7%	±2.9	2026-05-05
9	Terminus 2	Gemini 3.1 Pro	70.3%	±2.9	2026-05-05
10	Claude Code	Claude Opus 4.7	69.7%	±2.7	2026-05-01
11	Gemini CLI	Gemini 3 Pro	66.3%	±2.7	2026-05-02
12	Terminus 2	Claude Opus 4.7	66.1%	±2.7	2026-05-01
13	Claude Code	GLM 5.1	58.7%	±2.4	2026-05-02

Anthropic has separately reported Claude Fable 5 at 88.0% on its own verified run [3], which would place it at #1, but this submission was not run through the official Harbor pipeline at the time of writing.

Key Observations

The top is tight. Only 2.7 points separate the three-model podium (83.4%, 83.1%, 80.4%). All confidence intervals overlap, meaning the performance difference between GPT-5.5, Claude Fable 5, and Claude Opus 4.8 on Terminal-Bench v2.1 is statistically indistinguishable in a head-to-head comparison. This is striking given how far apart these models are on other benchmarks [4].

The spread is wide. The gap from #1 (83.4%) to #13 (58.7%) is 24.7 points. Contrast this with SWE-Bench Verified, where frontier models cluster within 6-8 points. Terminal-Bench appears to have substantially more discriminative power at the low end — older models truly struggle with CLI tasks in ways that bug-fixing benchmarks do not capture.

Version 2.1 is harder than 2.0. The CodingFleet analysis notes that v2.1 scores are roughly 5-8 points lower than equivalent v2.0 scores due to environment fixes [3]. This means scores from model cards that report v2.0 (vendor-reported) should not be compared directly with verified v2.1 results.

The Harness Effect

One of the most instructive patterns in the leaderboard is the harness effect — the spread in scores when the same model is paired with different agent architectures.

Claude Fable 5: 83.1% (Claude Code) vs 80.4% (Terminus 2) — a 2.7-point gap driven entirely by the agent harness. Claude Code is Anthropic’s first-party agent; Terminus 2 is a third-party research agent. The difference presumably reflects task-specific prompting and tool-use policies that Anthropic has optimized for its own model.

GPT-5.5: 83.4% (Codex CLI) vs 78.2% (Terminus 2) — a 5.2-point gap. Codex CLI is OpenAI’s specialized coding agent with deep task-specific scaffolding. The 5.2-point delta is the largest harness effect on the leaderboard and suggests that Codex CLI’s terminal interaction policies — retry logic, error recovery, output parsing — are substantially better tuned for CLI workflows than the generic Terminus 2 harness.

Gemini 3 Pro: 66.3% (Gemini CLI) vs 74.4% (Terminus 2) — an 8.1-point gap, but in the opposite direction. Gemini CLI underperforms Terminus 2 on the same model by a wide margin. The Gemini CLI is a relatively new product, and its terminal interaction patterns may not be as mature as Anthropic’s or OpenAI’s. Google’s first-party harness is actively a liability for Gemini on this benchmark.

Claude Opus 4.7: 69.7% (Claude Code) vs 66.1% (Terminus 2) — a 3.6-point gap favoring Claude Code.

The takeaway: benchmark scores measure the model and the harness, not either in isolation. A model that scores 74% in one harness may score 66% in another. This is not noise — it is the most important signal the benchmark produces for engineers building agent systems.

Inference-Time Scaling on Terminal-Bench

The most relevant research result for Terminal-Bench is the April 2026 paper “Scaling Test-Time Compute for Agentic Coding” by Kim et al. [5], which established structured test-time scaling methods for long-horizon coding agents.

The authors tested their methods on both SWE-Bench Verified and Terminal-Bench v2.0. Key results for Terminal-Bench:

Terminus 1 (baseline): 46.9% on Terminal-Bench v2.0
Terminus 1 + Recursive Tournament Voting (RTV): 53.2% — a 6.3-point gain from parallel aggregation of structured summaries
Terminus 1 + Parallel-Distill-Refine (PDR): 59.1% — a 12.2-point gain from sequential refinement conditioned on prior trajectories

The 12.2-point improvement from PDR is the largest absolute gain from any test-time scaling method reported on this benchmark. It exceeds the gap between Claude Fable 5 and Claude Opus 4.7 (13.4 points), suggesting that inference-time techniques can produce capability improvements comparable to a full model generation upgrade.

The paper’s core insight is that long-horizon agent trajectories violate the assumptions of standard test-time scaling (best-of-n, majority voting) because each attempt produces an extended trace of actions, observations, errors, and partial progress. The key is not generating more attempts but representing prior experience in a compact, selectable form — structured summaries that preserve hypotheses, progress, and failure modes while discarding low-signal trace details [5].

This connects directly to a broader result from Wu et al. (2024-2025) on inference scaling laws: smaller models paired with sophisticated inference algorithms can outperform larger models using simple decoding [6]. The implication for Terminal-Bench is that the current leaderboard may substantially understate what the listed models could achieve with optimized test-time compute allocation.

Category-Level Performance

The official leaderboard reports only aggregate pass@1 scores, but task-level analysis reveals significant performance variation by category [1][2]:

Software Engineering tasks (cross-compilation, build systems) show the widest model spread — the best models score above 75%, while weaker models drop below 40%. These tasks require multi-step reasoning with tight correctness constraints (compiler errors are deterministic).

System Administration tasks (package management, configuration) cluster more tightly, with most models scoring between 60-75%. The shell offers more forgiveness: a misconfigured service can be reconfigured; a failed build requires starting over.

Security tasks (certificate management, permission auditing) are the hardest category, with the top model scoring only 62%. These tasks require precise domain knowledge (OpenSSL flags, SELinux policies) that few models have internalized from training data.

Data Processing tasks (ETL, format conversion) are the easiest, with most frontier models scoring above 70%. These tasks are procedural and well-represented in training data.

Methodology Notes for Reproducibility

For engineers who want to reproduce or extend these results:

Run via Harbor: harbor run -d terminal-bench/terminal-bench-2-1 -a "agent" -m "model" -k 5 [1]
Custom agent: harbor run -d terminal-bench/terminal-bench-2-1 --agent-import-path "path.to.agent:SomeAgent" -k 5
Evaluation metric: pass@1 with 95% confidence intervals computed from 5 independent runs
Timeout: standardized per task, ranging from 60 to 600 seconds depending on task complexity
Resources: 4 CPU cores, 8GB RAM, no GPU — all tasks are CLI-only

The -k 5 flag is important: it runs 5 independent trials per task, producing statistically meaningful confidence intervals. Runners with single-shot evaluations will have substantially wider error bars.

Submissions may not modify timeouts or resource allocations — the Harbor framework enforces this at the container level [1]. This prevents the arms-race problem that has plagued other agent benchmarks where participants game scores by giving models unlimited compute.

What Terminal-Bench Reveals

The v2.1 leaderboard tells a story that SWE-Bench cannot. Terminal-Bench measures operational competence — the ability to navigate a filesystem, parse error messages, install dependencies, debug build failures, and recover from mistakes. These are the skills that determine whether a coding agent is useful in production, not just whether it can produce diffs for known issues.

Three conclusions for AI engineers:

Harness is a first-class variable. The 5+ point spread from different agents on the same model means that investing in agent infrastructure (error recovery, retry logic, output parsing) can yield larger gains than upgrading the underlying model.
Terminal tasks still discriminate. The 24.7-point spread from top to bottom means that CLI coding remains unsolved for most models. This is a harder problem than bug-fixing for current architectures.
Test-time scaling works. The 12.2-point gain from PDR on v2.0 suggests that inference-time compute allocation is a legitimate alternative to model upgrades — and one that avoids the latency and cost penalties of larger models.

References

[1] Terminal-Bench v2.1 Leaderboard, tbench.ai. https://www.tbench.ai/leaderboard/terminal-bench/2.1

[2] Terminal-Bench v2.1 Benchmark Leaderboard, Artificial Analysis. https://artificialanalysis.ai/evaluations/terminalbench-v2-1

[3] “Terminal-Bench 2.1 Leaderboard: AI CLI Coding Ranked (2026)”, CodingFleet, June 2026. https://codingfleet.com/blog/terminal-bench-leaderboard-2026/

[4] PricePerToken TerminalBench Leaderboard, June 2026. https://pricepertoken.com/leaderboards/benchmark/terminalbench

[5] Kim, J. et al. “Scaling Test-Time Compute for Agentic Coding.” arXiv:2604.16529, April 2026. https://arxiv.org/abs/2604.16529

[6] Wu, Y. et al. “Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models.” arXiv:2408.00724, August 2024 (revised March 2025). https://arxiv.org/abs/2408.00724