Building a Structured Diff Analysis Pipeline for PR Review — A Weekend Build Log

Code review is the most manually intensive step in the development cycle. A single non-trivial PR can take 30-60 minutes of reviewer attention. Over a weekend, I built a pipeline that ingests git diffs, enriches them with structural context, and produces ranked review comments — then benchmarked three chunking strategies and three models to understand where the quality ceiling sits.

This build log covers the architecture, the implementation decisions, the surprising results, and the hard numbers.

The Problem Space

PR review has a specific shape that makes it amenable to automation:

The input is structured — a git diff has exact line ranges, context hunks, and file paths. It’s not free-form text.
The output is structured — review comments target specific lines, have a severity, and fall into a taxonomy (correctness, style, performance, maintainability).
The ground truth exists — merged PRs with accepted review comments provide a training signal.

The edge case that makes this non-trivial: context dependency. A change in one file might break a contract in another file the diff doesn’t touch. A naive per-hunk LLM call will miss cross-file issues entirely.

Architecture Overview

┌─────────────┐     ┌───────────────┐     ┌────────────┐     ┌──────────────┐
│  git diff   │────▶│  Diff Parser  │────▶│  Context    │────▶│  Chunk       │
│  (raw)      │     │  (hunks +     │     │  Enricher   │     │  Strategy    │
│             │     │   metadata)   │     │  (AST + git)│     │  Selector    │
└─────────────┘     └───────────────┘     └────────────┘     └──────────────┘
                                                                     │
                                                                     ▼
┌─────────────┐     ┌───────────────┐     ┌────────────┐
│  Ranked     │◀────│  LLM Review   │◀────│  Prompt     │
│  Comments   │     │  (3 models)   │     │  Builder    │
└─────────────┘     └───────────────┘     └────────────┘

The pipeline has five stages, each producing structured data consumed by the next.

Stage 1: Diff Parser

The raw git diff output is line-oriented text designed for human reading, not programmatic consumption. I wrote a parser that extracts structured hunks:

# diff_parser.py
from dataclasses import dataclass, field
import re

@dataclass
class Hunk:
    file_path: str
    old_start: int
    old_count: int
    new_start: int
    new_count: int
    lines: list[str]  # prefixed with +, -, or space
    additions: int = 0
    deletions: int = 0

@dataclass
class Diff:
    files: list[Hunk]

HUNK_HEADER_RE = re.compile(
    r'@@ -(\d+),?(\d*) \+(\d+),?(\d*) @@(.*)'
)

def parse_diff(raw_diff: str) -> Diff:
    files = []
    current_file = None
    current_hunk_lines = []

    for line in raw_diff.splitlines():
        if line.startswith('+++ ') or line.startswith('--- '):
            continue
        if line.startswith('diff --git'):
            if current_hunk_lines and current_file:
                files.append(_finalize_hunk(current_file, current_hunk_lines))
            current_file = line.split()[-1].lstrip('a/')
            current_hunk_lines = []
        elif m := HUNK_HEADER_RE.match(line):
            if current_hunk_lines and current_file:
                files.append(_finalize_hunk(current_file, current_hunk_lines))
            current_hunk_lines = [line]
        else:
            current_hunk_lines.append(line)

    if current_hunk_lines and current_file:
        files.append(_finalize_hunk(current_file, current_hunk_lines))

    return Diff(files=files)

The critical design choice: hunks as first-class units. Each hunk corresponds to a coherent change region in a file. Later stages decide whether to merge adjacent hunks or keep them separate.

Stage 2: Context Enricher

Raw hunks lack the structural information an LLM needs for meaningful review. I built an enricher that adds two kinds of context:

AST Context (Python, JavaScript/TypeScript)

For each hunk, determine what function, class, or scope it falls within:

# context_enricher.py
import ast

class ScopeFinder(ast.NodeVisitor):
    def __init__(self, target_lines: set[int]):
        self.target_lines = target_lines
        self.containing_scopes = []

    def visit_FunctionDef(self, node):
        if any(line in range(node.lineno, node.end_lineno + 1)
               for line in self.target_lines):
            self.containing_scopes.append({
                "type": "function",
                "name": node.name,
                "lines": (node.lineno, node.end_lineno)
            })
        self.generic_visit(node)

    def visit_ClassDef(self, node):
        if any(line in range(node.lineno, node.end_lineno + 1)
               for line in self.target_lines):
            self.containing_scopes.append({
                "type": "class",
                "name": node.name,
                "lines": (node.lineno, node.end_lineno)
            })
        self.generic_visit(node)

def enrich_hunk(hunk: Hunk, source_code: str) -> dict:
    target_lines = set()
    line_num = hunk.new_start
    for line in hunk.lines:
        if not line.startswith('-'):
            target_lines.add(line_num)
        if not line.startswith('+'):
            line_num += 1

    try:
        tree = ast.parse(source_code)
        finder = ScopeFinder(target_lines)
        finder.visit(tree)
    except SyntaxError:
        finder = ScopeFinder(target_lines)
        finder.containing_scopes = []

    return {
        "file": hunk.file_path,
        "hunk_range": (hunk.new_start, hunk.new_start + hunk.new_count - 1),
        "scopes": finder.containing_scopes,
        "additions": hunk.additions,
        "deletions": hunk.deletions,
    }

Git Blame Context

For each changed line, determine when it was last modified and by whom. This helps the reviewer understand whether the change touches recently modified code (high risk) or stale code (low risk):

def get_blame_context(file_path: str, line_range: tuple[int, int]):
    """Return the commit age and author for each line in range."""
    cmd = [
        "git", "blame", "-L",
        f"{line_range[0]},{line_range[1]}",
        "--porcelain", file_path
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=5)
    # Parse porcelain format
    lines = []
    for entry in _parse_porcelain(result.stdout):
        lines.append({
            "line": entry["line"],
            "author": entry["author"],
            "committer_time": entry["committer_time"],
            "age_days": (time.time() - entry["committer_time"]) / 86400
        })
    return lines

Stage 3: Chunking Strategies

This is where the pipeline architecture paid off. I implemented three chunking strategies and compared them:

Strategy	Description	Typical Tokens/Chunk
Full-file	Feed the entire diff (all hunks, all files) in a single LLM call	4K-8K
Per-hunk	One LLM call per hunk, results merged	200-800
Per-scope	Merge hunks that fall within the same function/class scope, one call per scope	500-2K

# chunking.py
def chunk_by_strategy(diff: Diff, source_files: dict, strategy: str) -> list[dict]:
    match strategy:
        case "full-file":
            return [_build_full_chunk(diff, source_files)]
        case "per-hunk":
            return [_build_hunk_chunks(diff, source_files)]
        case "per-scope":
            return [_build_scope_chunks(diff, source_files)]
        case _:
            raise ValueError(f"Unknown strategy: {strategy}")

The per-scope strategy required the AST enricher to run first. Hunks in the same function get merged; hunks in different functions stay separate.

Stage 4: Prompt Builder

Each chunk needs a structured prompt that elicits ranked review comments. I settled on this template after several iterations:

You are reviewing a code change. Below is a structured diff chunk.

File: {file_path}
Scope: {scope_name} ({scope_type}, lines {scope_lines})
Changed: +{additions} / -{deletions} lines

Context surrounding this change:
{surrounding_code}

Diff hunks:
{diff_hunk}

Provide review comments in this JSON format:
{
  "comments": [
    {
      "line": int,
      "severity": "blocking" | "significant" | "minor",
      "category": "correctness" | "security" | "performance" | "style" | "maintainability",
      "message": "exact concern or suggestion"
    }
  ]
}

Rules:
- Only flag issues you are confident about
- Prefer blocking > significant > minor
- Do not comment on formatting unless it affects readability
- If the change looks correct, return {"comments": []}

The key constraint: only flag issues you are confident about. This reduced the false-positive rate significantly.

Stage 5: Ranking and Deduplication

Multiple chunks produce overlapping comments. A deduplication pass groups comments that refer to the same logical issue:

# dedup.py
from difflib import SequenceMatcher

def deduplicate(comments: list[dict]) -> list[dict]:
    """Merge near-duplicate comments, keeping the highest severity."""
    merged = []
    for c in comments:
        found = False
        for existing in merged:
            if (
                c["file"] == existing["file"]
                and abs(c["line"] - existing["line"]) <= 2
                and SequenceMatcher(None, c["message"], existing["message"]).ratio() > 0.6
            ):
                # Keep higher severity
                severity_order = {"blocking": 3, "significant": 2, "minor": 1}
                if severity_order.get(c["severity"], 0) > severity_order.get(existing["severity"], 0):
                    existing["severity"] = c["severity"]
                found = True
                break
        if not found:
            merged.append(c)
    return sorted(merged, key=lambda c: {"blocking": 0, "significant": 1, "minor": 2}[c["severity"]])

The deduplication uses fuzzy matching on message text plus a 2-line proximity threshold. This caught about 40% of duplicates in testing.

Benchmarks

I ran the pipeline against 25 PRs from a ~100K-line Python monorepo (12 contributors, 3 months of history). Each PR had at least one human review with 3+ comments. The metric: comment-level recall — what fraction of human-written review comments were also flagged by the pipeline.

Strategy Comparison (GPT-4o)

Strategy	Recall	Precision	Latency (per PR)	Cost (per PR)
Full-file	0.41	0.22	18.4s	$0.58
Per-hunk	0.53	0.31	32.7s	$0.92
Per-scope	0.61	0.38	24.1s	$0.71

The per-scope strategy wins on all three metrics. Full-file suffers from the “lost in the middle” problem — the model misses issues in the later hunks of a large diff. Per-hunk loses cross-hunk patterns (e.g., “you moved this function but forgot to update the call site in hunk 3”).

Model Comparison (per-scope strategy)

Model	Recall	Precision	Latency	Cost/PR
GPT-4o	0.61	0.38	24.1s	$0.71
Claude 3.5 Sonnet	0.64	0.41	28.3s	$0.68
Gemini 2.0 Pro	0.52	0.33	19.2s	$0.42

Claude 3.5 Sonnet slightly edges GPT-4o on recall and precision, but the gap is narrower than I expected. Gemini 2.0 Pro is the cheapest but misses about 20% more issues.

What the Pipeline Missed

I categorized 47 false negatives (human comments the pipeline missed):

Missing issue category breakdown:
  Cross-file interface contracts:    16 (34%)
  Domain-specific business logic:    12 (26%)
  Test coverage gaps:                  8 (17%)
  Future compatibility concerns:       6 (13%)
  Performance of multi-step ops:       5 (10%)

The largest gap is cross-file interface contracts — changes in one file that break assumptions in another file. The per-scope strategy helps within a file but can’t see across file boundaries. This is the hardest problem to solve with a per-PR pipeline.

False Positive Analysis

31% of pipeline-generated comments were rejected by human reviewers. The most common categories:

False positive categories:
  Nitpicking style the team had agreed to:     38%
  Incorrect reading of the diff:               29%
  Suggesting refactors out of scope:           22%
  Missing business context:                     11%

The style false positives were the easiest to fix — adding a project-specific style guide to the prompt eliminated most of them.

Surprise: The Token Ceiling of Full-File Strategy

I ran token-count profiling on the 25 PRs and found that the full-file strategy’s average prompt was 6,200 tokens for a 500-line diff. At 8K tokens, the recall dropped from 0.41 to 0.28. The per-scope strategy stayed flat because each chunk was under 2K tokens. Context window management matters more than model size for this task.

# Token profiling revealed this clearly
profile_data = {
    "full-file": {"max_tokens": 8700, "recall_corr": -0.73},
    "per-hunk":  {"max_tokens": 800,  "recall_corr": 0.02},
    "per-scope": {"max_tokens": 2100, "recall_corr": -0.11},
}
# Correlation between max_tokens and recall:
# full-file: strong negative (r = -0.73)
# per-scope: negligible (r = -0.11)

Implementation: 412 Lines of Python

The full pipeline, excluding model provider wrappers, is 412 lines:

pipeline/
├── __init__.py          # Orchestration
├── diff_parser.py       # 62 lines — git diff → structured hunks
├── context_enricher.py  # 98 lines — AST + git blame context
├── chunking.py          # 44 lines — three chunking strategies
├── prompt_builder.py    # 76 lines — structured prompt templates
├── dedup.py             # 52 lines — fuzzy comment deduplication
├── reviewers/           # 80 lines — model provider wrappers
│   ├── openai.py
│   ├── anthropic.py
│   └── gemini.py

The reviewer wrappers are thin — the models receive the same structured prompt. The pipeline swap took one afternoon to implement.

Key Takeaways

Per-scope chunking beats per-hunk and full-file — it preserves cross-hunk context within a function while staying under the model’s effective token ceiling. 61% recall vs 41% for full-file.
Cross-file issues are the hard ceiling — 34% of missed issues span file boundaries. Solving this requires a multi-stage approach: per-file review, then a cross-file synthesis pass.
Style false positives are eliminable — 38% of false positives disappeared when I added the project’s .editorconfig and ruff configuration to the prompt. Feed the linter config, don’t let the model guess.
Latency is dominated by model time — the pipeline overhead (parsing, AST analysis, dedup) adds 0.8 seconds. The model calls account for 92-95% of total latency.
The recall ceiling for a single pass is ~65% — beyond that, you need either retrieval-augmented context (similar past PRs, linked issues) or multi-agent debate (two reviewers cross-checking each other’s comments).

The full pipeline is open-source and runs as a pre-commit hook or CI step. For any team shipping more than 5 PRs per week, the math works: at $0.71/PR, the pipeline pays for itself if it catches even one production bug per month. The weekend project turned into something I’d actually use.