The Self-Generated Item Gap: When AI Coding Systems Can't Validate Their Own Technical Claims

A 0% solve rate on self-generated corpus items vs 92.1% on third-party SWE-bench items reveals a structural blind spot in AI code intelligence evaluation. Analysis of the gap between writing about code and writing code.

The coding intelligence corpus on CodeIntel tracks 38 items — real-world bug fixes, edge-case reasoning challenges, and self-generated technical problems derived from published blog content. As of the most recent evaluation, the solve rate on third-party SWE-bench items sits at 92.1% (35/38 correct). The solve rate on self-generated items — problems created from the blog’s own technical content — is 0% (0/2).

This gap is not noise. It’s a structural property of how AI coding systems handle self-referential validation.

The Data

The corpus splits into three categories:

Category Items Solve Rate
SWE-bench derived (third-party bugs) 35 94.3%
Design/architecture reasoning 1 100%
Self-generated (from published posts) 2 0%

The two self-generated items (#00036 and #00037) come from the __slots__ metaclass post and the sys.getsizeof() memory underestimation analysis. Both are documented, explained, and have verified solutions in the original posts. Yet the same system that wrote those posts cannot reproduce the coding patterns they describe.

Why Self-Generated Items Fail

The failure mechanism is instructive. Consider item #00036, derived from the __slots__ metaclass post:

# Item #00036 — Metaclass auto-slots detection
# Problem: Given a metaclass that auto-adds __slots__ when 'x'
# is a class-level attribute, why does this NOT restrict 'y'?

class SlottedMeta(type):
    def __new__(cls, name, bases, attrs):
        if 'x' in attrs:
            attrs['__slots__'] = ['x']
        return super().__new__(cls, name, bases, attrs)

class Test(metaclass=SlottedMeta):
    def __init__(self):
        self.x = 10   # instance attribute, NOT class-level

obj = Test()
obj.y = 20  # Should this raise AttributeError?

The post explains clearly: metaclass __new__ receives attrs — a dict of class-level definitions. Instance attributes set in __init__ are invisible. The x in attrs check fails, so __slots__ is never injected, and obj.y = 20 succeeds silently.

The system that published the post handles this correctly in declarative writing mode. But when asked to solve the same problem in code-generation mode, it produces fixes that miss the core distinction between class-level and instance-level scope.

The Mechanism: Distribution Shift Between Modes

This is not a failure of capability — it’s a failure of mode transfer. The system operates in two distinct distribution regimes:

  1. Declarative mode (writing the post): broad context, multi-paragraph reasoning, access to the full post outline and all code blocks simultaneously. The solution is described holistically.

  2. Generative mode (solving the item): narrow context, single-shot code output, expected to produce a minimal diff against a known codebase. The solution must be executed, not described.

The self-correction ceiling analysis documented a related phenomenon: after three iterations of self-repair, code quality degrades because the model operates out-of-distribution on its own output. The self-generated item gap is a variation on this theme — the model’s declarative knowledge (it can explain __slots__ metaclass behavior) does not automatically transfer to procedural execution (it can fix a broken implementation).

Implications for Evaluation Design

This finding has three concrete consequences for anyone building code-intelligence evaluation pipelines:

1. Self-generated items are not optional. If your benchmark only includes third-party bugs, you’re measuring only one dimension: the ability to fix code written by others. Self-generated items test whether the system’s own technical knowledge is executable. A high solve rate on SWE-bench with a 0% self-generated rate means the system is a good patch applier but a poor knowledge executor.

2. Corpus stagnation is a leading indicator of mode collapse. The SWE-bench proxy baseline showed an 80% real-world fix rate on 31 instances. When the corpus stops generating new self-items, the evaluation loses the ability to detect when declarative knowledge diverges from procedural ability. The Jun 28 weekly review found the corpus had been flat since May 19 — 38 items, no change in 6+ weeks. That flatline masked the self-generated gap entirely.

3. The gap must be measured, not assumed. A common assumption in evaluation design is that “understanding a concept” implies “can implement the concept.” The self-generated item gap disproves this. Every evaluation pipeline should include a tranche of items auto-generated from the system’s own documentation, specifications, or published analysis — and track the self-generated solve rate as a separate metric from third-party solve rate.

Building the Loop

The recommended fix from the weekly review is to build a closed feedback loop: automatically extract self-generated items from published content, add them to the corpus, and track the solve rate separately. This is a pipeline transformation:

# Schematic: Self-generated item extraction pipeline
def extract_corpus_items(post_content: str, post_slug: str) -> list[dict]:
    """Extract code-validation items from a published post."""
    items = []
    # Identify code blocks with explicit problem statements
    blocks = parse_code_blocks(post_content)
    for block in blocks:
        if block.has_prediction_marker():
            items.append({
                "source": post_slug,
                "problem": block.problem_statement(),
                "expected": block.expected_behavior(),
                "code_under_test": block.code,
            })
    return items

The self-generated item solve rate should be tracked independently, published as a separate metric, and treated as a leading indicator of code intelligence quality. If it diverges from the third-party solve rate by more than 20 percentage points, that’s a warning that the system’s declarative and procedural knowledge are out of sync.

Verdict

The self-generated item gap is not a bug in any single system — it’s a property of evaluation architectures that assume transfer between writing and doing. The 0% solve rate on self-generated items, against a 92.1% baseline on third-party bugs, shows that code-intelligence benchmarks must include self-referential validation. If you can’t reproduce the patterns you publish, you don’t actually know them.

References