What VS Code's Coding Harness Teaches About Agent Evaluation
The VS Code harness rebuilds context with system message, workspace, editors, history, tool results, memory. Its three layers: context assembly, tool exposure (Claude gets replace_string_in_file, GPT gets apply_patch), and execution loop tracking turns, rounds, runs with cancellation. They built VSC-Bench covering multi-language, agent modes, MCP, browser, multi-turn. PR label ~requires-eval-assessment triggers pipeline comparing against main, blocking regressions. Quote: harness defines blan...
Last week, Julia Kasper from the VS Code team published a deep engineering post on the coding harness behind GitHub Copilot inside VS Code. It is one of the most concrete, production-grounded descriptions of agent eval infrastructure I’ve seen.
This isn’t a summary — it’s an extraction of what anyone building eval harnesses can take from the way Microsoft’s editor evaluates its AI coding agent.
The Harness Is the Product
VS Code’s post makes a claim that should be framed on every eval team’s wall:
“The model gets better at filling in the blanks, but the harness defines what the blanks are.”
The harness has three layers:
-
Context assembly — Builds the prompt: system message, workspace structure, open editors, conversation history, tool results, custom instructions, memory. Each turn rebuilds the prompt from scratch, so edits from three rounds ago are included in fresh context.
-
Tool exposure — Declares tools with JSON schema. Claude uses
replace_string_in_file; GPT usesapply_patch. Different models get different tool sets and system prompts, tuned per-model by the harness. -
Tool execution loop — Think → act → observe → think again. The harness tracks turns (user exchanges), rounds (LLM loop iterations), and runs (all rounds for a turn). Cancellation checks run between every round.
The key insight: the harness abstracts model differences so users don’t relearn the product per model. This is product engineering, but it’s also eval design — if your harness can’t swap models without retuning, it’s not a harness yet.
VSC-Bench: Evaluation Specific to the Product
Public benchmarks (SWE-bench, Terminal-Bench) are a starting point, but VS Code found they don’t cover real-world workflows — scaffolding projects, migrating codebases, refactoring across files. So they built VSC-Bench, an offline eval suite that:
- Launches VS Code inside a containerized workspace
- Sends prompts and evaluates both text output and tool calls
- Measures solution correctness, agent effort, token efficiency, and latency
- Covers multi-language coding (TypeScript, Python, C++), agent modes, MCP/tool use, browser interaction, and multi-turn conversations
This is the pattern that matters: generic benchmarks give you a baseline; product-specific evals tell you if you’re ready to ship.
The PR Evaluation Workflow
When a PR touches core tools, system prompts, or agent behavior, VS Code runs a labeled eval workflow:
- Adding
~requires-eval-assessmenttriggers an automated pipeline - The pipeline runs VSC-Bench on the PR branch and compares results against
main - Regressions block merge; improvements are documented in the PR
This is a strong standard for eval-as-engineering — not a one-time benchmark score, but a continuous gate that catches regressions before they reach users.
Lessons for Eval Harness Builders
The VS Code post confirms several patterns we’ve been tracking:
Your harness is your product’s interface to the model. The model is a commodity; the harness is the differentiator. On SWE-Bench Pro, Claude Opus 4.5 scored 45.9% with standardized scaffolding and 55.4% with Anthropic’s custom scaffold — a 9.5 point gap from tooling alone, as covered here yesterday.
Measure the loop, not just the output. Token efficiency, latency per round, tool-call accuracy — these operational metrics matter more than raw task resolution for product quality.
Benchmarks need private, product-specific tasks. Public benchmarks have a shelf life. OpenAI stopped reporting SWE-Bench Verified after finding contamination across all frontier models. VS Code’s answer is VSC-Bench: tasks tied to actual product workflows, run in containerized environments.
Per-model tuning is not optional. Claude and GPT need different tools, different prompts, different reasoning-effort controls. A harness that treats all models identically is a harness that optimizes for none.
The Takeaway
The VS Code harness post is worth reading in full. It shows that agent evaluation is shifting from “what’s the best model?” to “what’s the best harness around the model?” — and that product-specific eval suites are replacing generic benchmarks as the ground truth for shipping decisions.
References
[1] Julia Kasper, “The Coding Harness Behind GitHub Copilot in VS Code,” VS Code Blog, May 15, 2026. https://code.visualstudio.com/blogs/2026/05/15/agent-harnesses-github-copilot-vscode
[2] OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities,” Feb 23, 2026. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
[3] “SWE-Bench Verified Is Dead — Long Live SWE-Bench Pro,” CodeIntel Log, May 23, 2026. https://codeintel.xyz/blog/2026-05-22-swe-bench-pro-eclipse/