Streaming Architecture for Large-Scale LLM Inference

Streaming is the default delivery mechanism for every major LLM provider — Anthropic, OpenAI, and Google all stream tokens by default when stream: true is set. But the protocol choice (SSE) is the easy part. The hard problems live in the surrounding infrastructure: reverse proxy buffering, backpressure under load, connection management at scale, and coordinating streaming with caching.

This post covers the architectural patterns that make streaming inference work in production, with concrete tradeoffs drawn from deployed systems.

The Transport Decision: SSE vs WebSocket vs gRPC

Every major provider converges on Server-Sent Events (SSE) for browser-facing token delivery [1]. The wire format is minimal:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

event: content_block_delta
data: {"type":"text_delta","text":"Hello"}

event: content_block_delta
data: {"type":"text_delta","text":" world"}

event: message_stop
data: {"type":"message_stop"}

SSE works through any HTTP-compliant proxy, reconnects natively, and requires no special infrastructure. But it has one critical limitation: the browser EventSource API only supports GET requests with no custom headers. For POST-based LLM APIs, you must use the fetch() API with ReadableStream instead [2].

When WebSocket Wins

WebSocket is overkill for unidirectional token delivery. Its value emerges in hybrid control-plane patterns:

Voice interaction: streaming audio from user while receiving text tokens from the model
Real-time steering: submitting tool results mid-generation without polling
Cancellation: sending a kill signal for an in-flight generation

The production pattern is SSE for the data plane, WebSocket for the control plane [1]. Anthropic’s API, for example, sends content via SSE typed events while the client can cancel via a separate channel.

gRPC for Internal Mesh

For service-to-service streaming (orchestrator → model backend), gRPC/HTTP/2 streaming delivers 40–60% reduction in connection overhead and 25–35% lower streaming latency compared to REST/JSON [1]. The caveat is head-of-line blocking: a single dropped TCP packet blocks all streams on that connection. HTTP/3/QUIC fixes this — global adoption sits at ~35% as of late 2025 [1].

Rule of thumb: SSE for browser-facing delivery, gRPC for the internal inference mesh, WebSocket only for bidirectional control.

The Reverse Proxy Problem

This is the most common production failure mode for streaming. Many reverse proxies (NGINX, Envoy, cloud load balancers) buffer complete upstream responses by default, which silently breaks streaming [2]. Signs of this failure:

Tokens arrive in bursts instead of a steady stream
The first token is delayed by the full generation time
Connections hang or timeout mid-stream

Fixes:

Disable proxy buffering on streaming routes: proxy_buffering off; in NGINX
Disable compression middleware (it buffers)
Set appropriate timeouts — default 60s is often too short for long generations
For Cloudflare, ensure the proxy is in DNS-only mode or use Argo Smart Routing with streaming-aware rules

Once HTTP 200 is sent, you cannot use status codes for errors. All errors must be delivered as stream events, and the client must distinguish network drops from server-reported errors [2].

Backpressure in Streaming Inference

Backpressure in LLM inference is asymmetric: the GPU generates tokens slower than the network can deliver them, but burst arrival patterns from speculative decoding or continuous batching create micro-bursts that overwhelm downstream consumers [3].

Three Backpressure Strategies

1. Token Bucket at the Gateway Limit token emission rate per client. This protects downstream parsers (especially JSON mode parsers that accumulate partial tokens) and prevents a single fast client from consuming disproportionate proxy connection slots.

2. Stream-Level Flow Control With gRPC streaming, the HTTP/2 flow control window regulates how many bytes the server sends before waiting for a client WINDOW_UPDATE. Tuning this window is critical for long generations — too small and throughput drops, too large and memory balloons [1].

3. Application-Level Backpressure For structured output (JSON mode, tool calls), the streaming layer must signal the inference engine when the client cannot keep up. This is the hardest pattern — it requires a closed-loop signal from the consumer back to the batching scheduler. Few systems implement this; most rely on buffer-and-drop.

The recommended architecture is a decoupled design with an intermediate store (Redis, Kafka) between inference and delivery [2]. The inference engine writes tokens to the store, and a reader goroutine per connection streams them out. This allows any backend to resume a disconnected session.

Continuous Batching Under Streaming

vLLM’s continuous batching is the dominant production pattern for streaming inference. Instead of waiting for a full batch to form, the scheduler adds and removes sequences at every forward pass [4]. This creates a specific challenge for streaming: each sequence in the batch generates tokens at different rates depending on position in the KV cache and the model’s internal attention patterns.

PagedAttention solves the memory side — dividing GPU memory into fixed-size pages (16 tokens each) with dynamic allocation that grows only as the sequence grows. This eliminates the 60–80% KV cache fragmentation waste of contiguous allocation [4]. vLLM achieves 793 tokens/second vs Ollama’s 41 TPS at equivalent configurations, with P99 latency of 80ms vs 673ms [4].

But the throughput number alone misses the architectural point: PagedAttention enables memory sharing across streaming requests. Identical prompt prefixes share KV cache pages, which means a shared system prompt in a chat application gives you 400%+ utilization improvements on standardized prompts [4].

Streaming + Caching: The Natural Tension

Streaming emits tokens incrementally; caching needs a complete response to store. The production pattern resolves this by doing both [2]:

On cache miss: stream tokens in real time, asynchronously store the full response once the stream finishes
On cache hit: return the complete cached response instantly — no streaming needed

Semantic caching (e.g., Redis LangCache) extends this to meaning-based matching. Queries are converted to vector embeddings and compared against cached queries. Benchmarks show cache hits are up to 15× faster and reduce LLM inference costs by up to 73% [2]. Stripe achieved exactly this — 73% inference cost reduction after migrating to vLLM, processing 50M daily API calls on 1/3 the GPU fleet [4].

Connection Management at Scale

Each streaming client holds an open HTTP connection for the duration of generation. At 10,000 concurrent users generating 30-second responses, that’s 10,000 open connections. The problems:

State accumulation: each connection holds a buffer position in the KV cache
Reconnection: a reconnecting client may land on a different backend instance and lose its session
Connection draining: rolling updates must wait for in-flight streams to complete or be gracefully migrated

Recommended patterns:

Decoupled architecture with an intermediate store — write tokens to Redis as they arrive; any backend can serve any client [2]
Kubernetes HPA with custom metrics — horizontal pod autoscaling based on queue depth and P99 latency, not CPU [4]
Graceful shutdown — reverse proxy should drain connections before terminating backend pods
Session affinity via prefix routing — route requests with identical prompt prefixes to the same backend to maximize KV cache reuse [4]

The Hybrid Architecture: What Production Looks Like

Pulling it together, a production streaming inference system has four layers:

Layer	Component	Responsibility
Gateway	Envoy / NGINX + LiteLLM	TLS termination, rate limiting, SSE passthrough (buffering off), token-based rate limiting
Orchestration	Kubernetes + KEDA	Autoscaling, rolling deployments, health checks
Inference	vLLM / SGLang	Continuous batching, PagedAttention, prefix caching
State	Redis / LMCache	Cross-instance KV cache sharing, streaming buffer, session state

Each layer has clear interfaces: the gateway speaks HTTP/SSE to clients and gRPC to the orchestration layer; the orchestration layer routes to inference engines via HTTP/2 streaming; the state layer is accessed by both orchestration and inference [1][3][5].

Disaggregated serving (separating prefill and decode into different GPU pools) is the next frontier — NVIDIA Dynamo + Blackwell delivers up to 30× throughput improvement over Hopper-era architectures [5].

Summary of Architectural Decisions

Decision	Recommendation	Why
Client-facing transport	SSE	Universal, works through proxies, no infra changes
Internal transport	gRPC/HTTP/2	25–35% lower latency for service-to-service
Control plane	WebSocket	Bidirectional for cancellation, tool injection
Proxy buffering	Off for streaming routes	Silent broken streaming is the #1 failure
Backpressure	Token bucket + decoupled store	Protects consumers, enables reconnection
Caching	Streaming + async storage	Cache hits bypass model; misses stream real-time
GPU memory	PagedAttention	60–80% less KV cache waste
Scaling	Queue-depth HPA, not CPU	CPU doesn’t correlate with in-flight generation

Sources

[1] Zylos Research, “LLM Output Streaming and Real-Time Token Delivery Architectures,” March 2026. https://zylos.ai/research/2026-03-28-llm-output-streaming-token-delivery-architectures/

[2] Redis Blog (Jim Allen Wallace), “Streaming LLM Responses: Make Your AI App Feel Fast,” April 2026. https://redis.io/blog/streaming-llm-responses/

[3] RunPod, “AI Model Serving Architecture: Building Scalable Inference APIs for Production Applications,” 2026. https://www.runpod.io/articles/guides/ai-model-serving-architecture-building-scalable-inference-apis-for-production-applications

[4] Introl Blog, “vLLM Production Deployment,” 2026. https://introl.com/blog/vllm-production-deployment-inference-serving-architecture

[5] NVIDIA, “Engineering Real-World LLM Inference: Bridging Open-Source and Production Systems,” GTC 2026. https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s82026/

NiteAgent — AI agent development, frameworks, and production patterns
ToolBrain — tool reviews, LLM comparisons, and AI workflow guides

Cross-links automatically generated from CodeIntel Log.