The Self-Correction Ceiling: Why Agentic Code Repair Hits Diminishing Returns at Three Iterations
Studies confirm a three-iteration wall: 7B models peak at 56% (from 38% initial) then decline, larger models hit 74% (from 62%). Bottlenecks:…

The canonical agent coding loop looks elegant on paper: generate code, run tests, feed errors back to the model, regenerate. Rinse and repeat until green.
Production teams who deploy this loop at scale discover a frustrating pattern. Resolution improves dramatically from attempt 1 to attempt 2. It improves modestly from attempt 2 to attempt 3. And after attempt 3? Performance flatlines — or, worse, degrades. Correct solutions get mutated into incorrect ones. The model over-corrects, second-guesses working logic, and introduces new bugs while fixing old ones.
This isn’t anecdotal. Three 2026 papers independently confirm the phenomenon, each measuring it from a different angle: Arimbur et al. on iterative self-repair, the Denoising Self-Correction study, and the Graph-Based Regression work.
The Three-Iteration Wall
A systematic study across seven models spanning three families — measuring self-repair on HumanEval, MBPP, and APPS — found that the marginal benefit of additional correction rounds drops to near zero after 3 iterations. For smaller models (7B–13B parameters), success rate actually decreases after the third attempt, as the model begins overriding correct sub-expressions.
Iteration | Success Rate (7B models) | Success Rate (70B+ models)
----------|------------------------|--------------------------
1 | 38% | 62%
2 | 52% | 71%
3 | 56% | 74%
4 | 54% | 73%
5 | 53% | 72%
The asymptotic ceiling isn’t a test-hardware limit. It’s a property of the feedback loop itself.
Why Self-Correction Bottlenecks
Three root causes explain the ceiling:
1. Correct-Path Cannibalization
When a model generates a mostly-correct solution (passing 80% of tests), the error feedback targets the failing 20%. But the temperature-based regeneration resamples the entire solution, not just the failing section. The model may “fix” the broken part while accidentally breaking working code elsewhere. Larger models are less susceptible — their attention heads better isolate relevant context — but they aren’t immune.
2. Feedback Over-Absorption
Models that receive verbose execution traces or long compiler error logs tend to over-index on the last error in the trace. A bug in a utility function at line 80 causes the model to rewrite the entire utility layer, introducing cascading failures that the original solution avoided. The Denoising Iterative Self-Correction paper (June 2026) showed that structured verification loops — which compress execution feedback into targeted assertions rather than raw traces — improved ceiling by roughly 12 points across all model tiers.
3. Distributional Drift in the Error Surface
Each regeneration step moves the current solution further from the model’s pretraining distribution. Original code draws from the full learned manifold. After one correction, the code is a hybrid of native generation and error-aware patching. After three corrections, the code is a heavily interleaved latent that fits none of the training distribution’s modes well. The model is effectively operating out-of-distribution on its own output, and no amount of prompting can recover the original coherence.
Architectures That Break Through
Not all feedback loops are created equal. The teams that push past the 3-iteration ceiling share two architectural patterns:
Graph-Based Regression Guarding
The most effective countermeasure — from the Reducing Code Regressions paper — is a regression-aware search strategy that tracks which tests passed in each iteration and refuses regenerations that would break previously-passing tests. This raised resolution from 12% to 60% on a 10-instance subset, an effective 5× improvement that directly targets the correct-path cannibalization problem.
class RegressionAwareLoop:
def __init__(self):
self.passing_tests = set()
self.current_solution = None
def propose_fix(self, failing_tests: list[str]) -> str | None:
"""Only accept regenerations that preserve all previously-passing tests."""
for candidate in self._sample_candidates(failing_tests, n=5):
result = self._run_all_tests(candidate)
if result.passing >= self.passing_tests:
self.passing_tests = result.passing
return candidate
return self.current_solution # No regression-free fix found
The critical insight: prefer no change over a change that regresses coverage. This simple guardrail prevents the distributional drift that kills multi-iteration loops.
Structured Denoising Verification
The Denoising Iterative Self-Correction approach replaces raw execution traces with structured verification:
- Execute the solution once.
- For each failing assertion, generate a minimal failing input (not the full trace).
- Feed only the minimal reproducer and the expected output back to the model.
This compresses the feedback signal to exactly what the model needs to correct, reducing over-absorption by 40% in controlled trials.
Choosing Your Iteration Budget
The optimal number of correction rounds depends on model scale and task complexity:
| Model Scale | Optimal Iterations | Diminishing Returns After |
|---|---|---|
| < 15B params | 2 | Iteration 2 |
| 15B–70B params | 3 | Iteration 3 |
| 70B+ params | 3–4 | Iteration 4 |
| Multi-agent (orchestrated) | 5+ | Iteration 6+ |
Multi-agent setups outperform single-model loops because the specialist agents (critic, coder, tester) maintain narrower distributions — each agent stays in-distribution for its role, and the orchestration layer handles the regression guard.
Key Takeaways
-
Stop at three. Most agent coding loops should terminate after 3 correction rounds unless they have regression guards. More iterations without structural improvements waste tokens and degrade quality.
-
Compress your feedback. Raw execution traces overload the model. Convert errors to minimal failing inputs or targeted assertions.
-
Guard against regression. A regression-free guard is the single highest-ROI addition to any agent coding loop. It directly neutralizes the correct-path cannibalization problem.
-
Scale beyond the ceiling with orchestration. Multi-agent systems (separate critic + coder + tester roles) push the ceiling higher by keeping each component’s generation distribution narrow.
The self-correction ceiling isn’t a fundamental limit — it’s a signal that the naive generate-fix loop architecture has maxed out. The next generation of agent coding systems will look less like a single model talking to itself and more like a distributed team with structured handoffs and regression-aware search.
Citations: [1] Arimbur et al., “How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation,” arXiv:2604.10508, Apr 2026. [2] “Denoising Iterative Self-Correction: Structured Verification Loops,” arXiv:2606.21724, Jun 2026. [3] “Reducing Code Regressions in AI Coding Agents via Graph-Based Search,” arXiv:2603.17973v1, Mar 2026.
📖 Related Reads
- NiteAgent — AI agent development, frameworks, and production patterns
- ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
- Hermes Tutorials — Hermes Agent setup, configuration, and advanced workflows
Cross-links automatically generated from CodeIntel Log.