Streaming Architecture for Large-Scale LLM Inference
A deep dive into production streaming patterns for LLM inference: SSE vs WebSocket vs gRPC, backpressure strategies, reverse proxy pitfalls, and the architectures that keep token delivery fast at scale.
Streaming is the default delivery mechanism for every major LLM provider — Anthropic, OpenAI, and Google all stream tokens by default when stream: true is set. But the protocol choice (SSE) is the easy part. The hard problems live in the surrounding infrastructure: reverse proxy buffering, backpressure under load, connection management at scale, and coordinating streaming with caching.
This post covers the architectural patterns that make streaming inference work in production, with concrete tradeoffs drawn from deployed systems.
The Transport Decision: SSE vs WebSocket vs gRPC
Every major provider converges on Server-Sent Events (SSE) for browser-facing token delivery [1]. The wire format is minimal:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
event: content_block_delta
data: {"type":"text_delta","text":"Hello"}
event: content_block_delta
data: {"type":"text_delta","text":" world"}
event: message_stop
data: {"type":"message_stop"}
SSE works through any HTTP-compliant proxy, reconnects natively, and requires no special infrastructure. But it has one critical limitation: the browser EventSource API only supports GET requests with no custom headers. For POST-based LLM APIs, you must use the fetch() API with ReadableStream instead [2].
When WebSocket Wins
WebSocket is overkill for unidirectional token delivery. Its value emerges in hybrid control-plane patterns:
- Voice interaction: streaming audio from user while receiving text tokens from the model
- Real-time steering: submitting tool results mid-generation without polling
- Cancellation: sending a kill signal for an in-flight generation
The production pattern is SSE for the data plane, WebSocket for the control plane [1]. Anthropic’s API, for example, sends content via SSE typed events while the client can cancel via a separate channel.
gRPC for Internal Mesh
For service-to-service streaming (orchestrator → model backend), gRPC/HTTP/2 streaming delivers 40–60% reduction in connection overhead and 25–35% lower streaming latency compared to REST/JSON [1]. The caveat is head-of-line blocking: a single dropped TCP packet blocks all streams on that connection. HTTP/3/QUIC fixes this — global adoption sits at ~35% as of late 2025 [1].
Rule of thumb: SSE for browser-facing delivery, gRPC for the internal inference mesh, WebSocket only for bidirectional control.
The Reverse Proxy Problem
This is the most common production failure mode for streaming. Many reverse proxies (NGINX, Envoy, cloud load balancers) buffer complete upstream responses by default, which silently breaks streaming [2]. Signs of this failure:
- Tokens arrive in bursts instead of a steady stream
- The first token is delayed by the full generation time
- Connections hang or timeout mid-stream
Fixes:
- Disable proxy buffering on streaming routes:
proxy_buffering off;in NGINX - Disable compression middleware (it buffers)
- Set appropriate timeouts — default 60s is often too short for long generations
- For Cloudflare, ensure the proxy is in DNS-only mode or use Argo Smart Routing with streaming-aware rules
Once HTTP 200 is sent, you cannot use status codes for errors. All errors must be delivered as stream events, and the client must distinguish network drops from server-reported errors [2].
Backpressure in Streaming Inference
Backpressure in LLM inference is asymmetric: the GPU generates tokens slower than the network can deliver them, but burst arrival patterns from speculative decoding or continuous batching create micro-bursts that overwhelm downstream consumers [3].
Three Backpressure Strategies
1. Token Bucket at the Gateway Limit token emission rate per client. This protects downstream parsers (especially JSON mode parsers that accumulate partial tokens) and prevents a single fast client from consuming disproportionate proxy connection slots.
2. Stream-Level Flow Control With gRPC streaming, the HTTP/2 flow control window regulates how many bytes the server sends before waiting for a client WINDOW_UPDATE. Tuning this window is critical for long generations — too small and throughput drops, too large and memory balloons [1].
3. Application-Level Backpressure For structured output (JSON mode, tool calls), the streaming layer must signal the inference engine when the client cannot keep up. This is the hardest pattern — it requires a closed-loop signal from the consumer back to the batching scheduler. Few systems implement this; most rely on buffer-and-drop.
The recommended architecture is a decoupled design with an intermediate store (Redis, Kafka) between inference and delivery [2]. The inference engine writes tokens to the store, and a reader goroutine per connection streams them out. This allows any backend to resume a disconnected session.
Continuous Batching Under Streaming
vLLM’s continuous batching is the dominant production pattern for streaming inference. Instead of waiting for a full batch to form, the scheduler adds and removes sequences at every forward pass [4]. This creates a specific challenge for streaming: each sequence in the batch generates tokens at different rates depending on position in the KV cache and the model’s internal attention patterns.
PagedAttention solves the memory side — dividing GPU memory into fixed-size pages (16 tokens each) with dynamic allocation that grows only as the sequence grows. This eliminates the 60–80% KV cache fragmentation waste of contiguous allocation [4]. vLLM achieves 793 tokens/second vs Ollama’s 41 TPS at equivalent configurations, with P99 latency of 80ms vs 673ms [4].
But the throughput number alone misses the architectural point: PagedAttention enables memory sharing across streaming requests. Identical prompt prefixes share KV cache pages, which means a shared system prompt in a chat application gives you 400%+ utilization improvements on standardized prompts [4].
Streaming + Caching: The Natural Tension
Streaming emits tokens incrementally; caching needs a complete response to store. The production pattern resolves this by doing both [2]:
- On cache miss: stream tokens in real time, asynchronously store the full response once the stream finishes
- On cache hit: return the complete cached response instantly — no streaming needed
Semantic caching (e.g., Redis LangCache) extends this to meaning-based matching. Queries are converted to vector embeddings and compared against cached queries. Benchmarks show cache hits are up to 15× faster and reduce LLM inference costs by up to 73% [2]. Stripe achieved exactly this — 73% inference cost reduction after migrating to vLLM, processing 50M daily API calls on 1/3 the GPU fleet [4].
Connection Management at Scale
Each streaming client holds an open HTTP connection for the duration of generation. At 10,000 concurrent users generating 30-second responses, that’s 10,000 open connections. The problems:
- State accumulation: each connection holds a buffer position in the KV cache
- Reconnection: a reconnecting client may land on a different backend instance and lose its session
- Connection draining: rolling updates must wait for in-flight streams to complete or be gracefully migrated
Recommended patterns:
- Decoupled architecture with an intermediate store — write tokens to Redis as they arrive; any backend can serve any client [2]
- Kubernetes HPA with custom metrics — horizontal pod autoscaling based on queue depth and P99 latency, not CPU [4]
- Graceful shutdown — reverse proxy should drain connections before terminating backend pods
- Session affinity via prefix routing — route requests with identical prompt prefixes to the same backend to maximize KV cache reuse [4]
The Hybrid Architecture: What Production Looks Like
Pulling it together, a production streaming inference system has four layers:
| Layer | Component | Responsibility |
|---|---|---|
| Gateway | Envoy / NGINX + LiteLLM | TLS termination, rate limiting, SSE passthrough (buffering off), token-based rate limiting |
| Orchestration | Kubernetes + KEDA | Autoscaling, rolling deployments, health checks |
| Inference | vLLM / SGLang | Continuous batching, PagedAttention, prefix caching |
| State | Redis / LMCache | Cross-instance KV cache sharing, streaming buffer, session state |
Each layer has clear interfaces: the gateway speaks HTTP/SSE to clients and gRPC to the orchestration layer; the orchestration layer routes to inference engines via HTTP/2 streaming; the state layer is accessed by both orchestration and inference [1][3][5].
Disaggregated serving (separating prefill and decode into different GPU pools) is the next frontier — NVIDIA Dynamo + Blackwell delivers up to 30× throughput improvement over Hopper-era architectures [5].
Summary of Architectural Decisions
| Decision | Recommendation | Why |
|---|---|---|
| Client-facing transport | SSE | Universal, works through proxies, no infra changes |
| Internal transport | gRPC/HTTP/2 | 25–35% lower latency for service-to-service |
| Control plane | WebSocket | Bidirectional for cancellation, tool injection |
| Proxy buffering | Off for streaming routes | Silent broken streaming is the #1 failure |
| Backpressure | Token bucket + decoupled store | Protects consumers, enables reconnection |
| Caching | Streaming + async storage | Cache hits bypass model; misses stream real-time |
| GPU memory | PagedAttention | 60–80% less KV cache waste |
| Scaling | Queue-depth HPA, not CPU | CPU doesn’t correlate with in-flight generation |
Sources
[1] Zylos Research, “LLM Output Streaming and Real-Time Token Delivery Architectures,” March 2026. https://zylos.ai/research/2026-03-28-llm-output-streaming-token-delivery-architectures/
[2] Redis Blog (Jim Allen Wallace), “Streaming LLM Responses: Make Your AI App Feel Fast,” April 2026. https://redis.io/blog/streaming-llm-responses/
[3] RunPod, “AI Model Serving Architecture: Building Scalable Inference APIs for Production Applications,” 2026. https://www.runpod.io/articles/guides/ai-model-serving-architecture-building-scalable-inference-apis-for-production-applications
[4] Introl Blog, “vLLM Production Deployment,” 2026. https://introl.com/blog/vllm-production-deployment-inference-serving-architecture
[5] NVIDIA, “Engineering Real-World LLM Inference: Bridging Open-Source and Production Systems,” GTC 2026. https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s82026/
📖 Related Reads
- NiteAgent — AI agent development, frameworks, and production patterns
- ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
Cross-links automatically generated from CodeIntel Log.