When Heaps Lie: Debugging Phantom Memory Leaks in vLLM Production

A systematic root cause analysis of three real vLLM production memory failures — how malloc profiling, scheduler tracing, and KV cache fragmentation analysis revealed bugs that standard monitoring could not detect.

Production LLM inference is a memory-constrained environment. A single H100 with 80 GB of HBM3 must simultaneously hold model weights (~15-70 GB for a 7B-70B parameter model), KV cache for active sequences, CUDA graphs, intermediate activations, and framework overhead. When something goes wrong with memory, it manifests as OOMKilled exit code 137 — a pod death that triggers pager duty at 3 AM.

But not all memory failures are leaks. Some are fragmentation. Some are allocation-order bugs. Some are measurement artifacts that look like leaks. Over the past year, three separate engineering teams published detailed postmortems of vLLM memory failures that defy conventional debugging. This post synthesizes their findings into a systematic methodology for diagnosing memory failures in production LLM serving.

The Three Case Studies

Three documented failure patterns, each with a distinct root cause:

Case 1 (Mistral AI): A “memory leak” in vLLM workers that turned out to be a measurement artifact — RSS grew over time, but heap profiling showed no actual leak. The root cause was CUDA graph memory allocation that appeared in resident_set_size_bytes but was invisible to malloc-level profilers like Heaptrack. [1]

Case 2 (Kubenatives): A vLLM pod that OOMed only between 2-4 AM when traffic was at its lowest. The pattern defied every hypothesis — not a bad request, not a slow leak, not a noisy neighbor. Root cause: low-QPS conditions changed vLLM’s batching behavior, exposing KV cache fragmentation from prefill/decode allocation asymmetry. [2]

Case 3 (AI21 Labs): A state corruption bug in vLLM’s Mamba implementation where a single token could corrupt an entire sequence’s hidden state. The bug lived at the intersection of vLLM’s scheduler and Mamba’s recurrence-based architecture — the scheduler would reorder tokens within a batch, and Mamba’s state machine was not designed for non-sequential token processing. [3]

Why Standard Debugging Fails

The standard memory debugging playbook assumes a leak is a leak: RSS grows → malloc tracking finds the culprit → fix it. In vLLM production, this pipeline breaks in three ways:

1. RSS != malloc

vLLM uses CUDA memory pools, caching allocators, and pre-allocated KV cache blocks. When container_memory_working_set_bytes climbs, it could be:

  • A real heap leak (Python objects, tokenizer prefixes, etc.)
  • CUDA memory allocator fragmentation (the torch.cuda.memory pool grows but never shrinks)
  • Cache warmup for CUDA graphs (vLLM V1 compiles graphs lazily, which shows as RSS growth during inference)
  • KV cache block fragmentation (internal fragmentation within pre-allocated blocks)

Mistral’s debugging team found that Heaptrack (a malloc-level profiler) showed a completely flat memory profile while RSS climbed by 2+ GB per hour. [1] The “leak” was actually CUDA graph memory being allocated for new prompt lengths — something malloc profilers cannot see.

# You cannot detect CUDA memory fragmentation with heaptrack
# vLLM's memory pool is managed by CUDA, not malloc
import torch

# This allocation pattern causes fragmentation visible only in nvidia-smi
for _ in range(100):
    # Variable-size tensors fragment the CUDA allocator
    t = torch.empty(random.randint(100, 1000), device='cuda')
    del t  # Returns to CUDA pool, not to OS — pool grows monotonically
    torch.cuda.empty_cache()  # Partial release at best

2. Low-Traffic Is Highest Risk

The Kubenatives case is the most counterintuitive. The OOMs happened at 3 AM when QPS was under 5% of peak. [2] The root cause:

At low QPS, vLLM’s continuous batching (iteration-level scheduling) processes fewer concurrent sequences per iteration. Each sequence goes through a prefill phase (compute KV cache for the prompt) followed by a decode phase (generate tokens one at a time). The prefill phase allocates memory in large chunks (the entire prompt’s worth of KV cache at once), while decode extends incrementally.

At low concurrency, the prefill allocation pattern dominates — each new request triggers a large KV cache allocation for the prompt, which the CUDA allocator services by splitting existing blocks. Over a 24-hour period with repeated prefill-deallocate cycles, the CUDA allocator’s block list fragments until no single contiguous block can satisfy the next prefill. OOM at 3 AM.

# Simulating KV cache fragmentation at low QPS
# At low concurrency: prefill dominates => fragmentation
# At high concurrency: decode dominates => steady-state reuse

class SimulatedAllocator:
    def __init__(self, total=80 * 1024**3):  # 80 GB HBM3
        self.free_blocks = [total]
        self.allocated = []
    
    def prefill_allocate(self, size):
        # Prefill grabs a large contiguous chunk
        for i, block in enumerate(self.free_blocks):
            if block >= size:
                self.free_blocks[i] = block - size
                return size
        return None  # OOM from fragmentation, not exhaustion

Raising the memory limit only delayed the problem because it gave more headroom before the allocator hit the fragmentation wall. The fix required aligning KV cache block sizes to the allocator’s chunk size.

3. Scheduler Reordering Exposes Model Bugs

AI21’s Mamba bug is a class of failure unique to state-space models in vLLM. [3] Mamba’s recurrence means each token’s output depends on the previous token’s state within the same sequence. vLLM’s scheduler, however, operates on the principle that tokens within a batch are independent — it can reorder, preempt, or pause individual sequences without affecting others.

The bug: when the scheduler reordered tokens in a Mamba batch, the state machine carried forward a stale hidden state from the wrong sequence position. One token could corrupt the internal state of an entirely different sequence. The observable symptom was output quality degradation — not a crash, not an OOM, just wrong answers.

# Simplified Mamba state corruption from scheduler reorder
class MambaSequence:
    def __init__(self):
        self.state = None  # Recurrent state, sequence-dependent
    
    def step(self, token, prev_state):
        # state = f(token, prev_state) — depends on sequential order
        return token * prev_state + 1  # Simplified

# Scheduler reorders tokens between sequences: BUG
batch = [
    (seq_A, token_5, seq_A.prev_state),   # From sequence A
    (seq_B, token_3, seq_B.prev_state),   # From sequence B — correct
]
# Scheduler optimizes for GPU utilization, not state ordering

A Systematic Debugging Methodology

From these three case studies, a repeatable methodology emerges:

Phase 1: Separate Measurement from Reality

When RSS grows, first ask: is this a real leak or a measurement artifact?

$ python3 -c "
import torch
# Check CUDA allocator state
print(torch.cuda.memory_summary(device='cuda:0'))
# vs
import os, psutil
proc = psutil.Process(os.getpid())
print(f'RSS: {proc.memory_info().rss / 1024**3:.1f} GB')
"

If RSS grows but torch.cuda.memory_summary() shows flat allocated memory, you have fragmentation — not a leak. If both grow but the CUDA tracker shows different numbers than nvidia-smi, you have CUDA driver memory (CUDA graphs, compiled kernels) leaking outside the allocator’s accounting.

Phase 2: Profile by Request Phase

vLLM’s scheduler has three distinct phases: prefill, decode, and idle. Each phase has a different memory profile. To find the fragmenting phase:

  1. Enable VLLM_LOG_STATS_INTERVAL=1 — logs per-iteration running queue, waiting queue, and cache hit stats
  2. Set VLLM_LOGGING_LEVEL=DEBUG for scheduler-level event tracing
  3. Correlate memory metrics with phase transitions using Grafana or your observability backend
# Launch with debugging
VLLM_LOG_STATS_INTERVAL=5 \
VLLM_LOGGING_LEVEL=DEBUG \
python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

The Kubenatives team found their OOMs correlated with extended idle periods followed by a burst of concurrent prefills — the exact pattern that exposes fragmentation. [2]

Phase 3: Allocator-Level Tracing

When the CUDA allocator is the suspect, enable allocator tracing:

import torch.cuda._memory_viz as viz
# Generate a memory snapshot
viz._save_memory_snapshot('snapshot.pkl')
# Analyze block fragmentation
from torch.cuda._memory_viz import _blocks

This produces a visualization of every allocated block, its size, and its device address. Fragmentation shows as a large number of small free blocks between large allocated blocks — the standard “Swiss cheese” pattern where total free space exceeds the largest contiguous free block.

Root Cause Taxonomy

Based on these three real cases plus public vLLM issue tracker data, here are the most common production memory failure modes and their signatures:

PatternSignatureDetectionFix
CUDA graph memory creepRSS grows, torch.cuda.memory flatnvidia-smi vs PyTorch memory tracking mismatchLimit graph prompt length variation, prefetch common lengths
KV cache fragmentationOOM at low QPS, memory limit adjustment delays but doesn’t preventDuration since last prefill vs memory fragmentation ratioAlign block sizes, pool KV cache per model config
Python object leakBoth RSS and heap profiler show growthHeaptrack or tracemallocFix the leaking object (TokenizerPrefixTreeNode in lm-format-enforcer) [4]
NIXL lazy initSpiky growth at first inference, then flatProfile with nvidia-smi during warmupPre-initialize communication libraries at startup
Scheduler-state mismatchWrong output, no OOM, non-deterministicSequence-level checksum comparisonFix scheduler to preserve sequence order in stateful models

Prevention: Building Memory-Aware Monitoring

The hardest lesson from all three case studies is that standard container-level memory monitoring (container_memory_working_set_bytes) is insufficient for vLLM. You need GPU-level, allocator-level, and scheduler-phase-level observability.

A practical monitoring stack:

# Prometheus rules for vLLM memory health
groups:
  - name: vllm_memory
    rules:
      - alert: VLLMHighFragmentationRatio
        expr: |
          (nvidia_smi_memory_used_bytes - vllm_kv_cache_used_bytes)
          / nvidia_smi_memory_total_bytes > 0.15
        for: 5m
        annotations:
          summary: "Non-KV-cache memory >15% of total — possible fragmentation"

      - alert: VLLMFragmentationSpikeOnIdle
        expr: |
          rate(nvidia_smi_memory_used_bytes[15m])
          > 0 and rate(http_requests_total[15m]) < 0.01
        annotations:
          summary: "Memory growing during idle — KV cache fragmentation or leak"

The key metric is the ratio of allocated-but-unused GPU memory to total GPU memory. When this spikes during idle, you have fragmentation. When it grows monotonically during active serving, you have a leak. When both grow and the ML framework’s allocator disagrees with nvidia-smi, you have a measurement artifact.

What I Got Wrong

Early versions of this analysis assumed that “RSS growth = memory leak” was always true for CUDA workloads. The Mistral case proved otherwise — and the debugging team wasted weeks chasing a leak that didn’t exist because they trusted RSS as a proxy for heap memory. The correct mental model: CUDA memory has three layers (driver-level, allocator-level, framework-level), and each layer has its own accounting. Always measure at the layer you’re debugging.

The second wrong assumption was that low traffic is safe. The Kubenatives case shows the opposite — low concurrency maximizes the prefill/decode allocation asymmetry that causes fragmentation. If you’re serving LLMs with highly variable traffic patterns, your OOM risk is highest during the quiet hours.

References

[1] Mistral AI. “Heaps do lie: debugging a memory leak in vLLM.” Mistral AI Blog, 2025. https://mistral.ai/news/debugging-memory-leak-in-vllm/

[2] Sharon Sahadevan. “Production Case Study: The vLLM Pod That Only OOMed at 3 AM.” Kubenatives, April 2026. https://www.kubenatives.com/p/vllm-production-case-study-3am-oom-investigation

[3] Asaf Gardin. “One Token to Corrupt Them All: A vLLM Debugging Tale.” AI21 Labs Blog, January 2026. https://www.ai21.com/blog/vllm-debugging-mamba-bug/

[4] vLLM Project. “Bug: memory leak — Issue #8629.” GitHub, 2025. https://github.com/vllm-project/vllm/issues/8629

  • ToolBrain — tool reviews, LLM comparisons, and AI workflow guides

Cross-links automatically generated from CodeIntel Log.