LLM Serving Benchmark: vLLM vs SGLang — Throughput, Latency, and Architecture Tradeoffs

The choice of LLM serving backend is one of the highest-leverage infrastructure decisions a production AI team makes. The wrong choice means paying 3–10x more per token in user-facing latency, or leaving throughput on the table [1]. But benchmark data is fragmented, vendor-driven, and rarely reproduced.

This post presents an empirical comparison of vLLM and SGLang — the two dominant open-source LLM serving frameworks — on production-relevant metrics: time-to-first-token (TTFT), inter-token latency (ITL), end-to-end latency, and throughput under both online and offline workloads.

Methodology

All benchmark results reproduced here are derived from the SGLang project’s published comparison (v0.3.0 vs vLLM v0.6.0) [1], augmented with reproduction notes from the vLLM project’s own benchmarking infrastructure [2] and the MLPerf Inference reference methodology [3].

Test configuration:

Model A: Meta Llama 3.1 8B Instruct — single NVIDIA A100 80G
Model B: Meta Llama 3.1 70B Instruct — 4 × NVIDIA H100 80G (tensor parallel)
Dataset: ShareGPT (real user conversations, 4096 max model length)
Workload: Online (bounded request rate) + Offline (max throughput, unlimited concurrency)
vLLM flags: --num-scheduler-steps 10, default gpu_memory_utilization (0.9)
SGLang flags: --disable-radix-cache, --enable-torch-compile (8B only)
Metrics: Median TTFT, Median ITL, Median TPOT, Median E2E Latency, Output Token Throughput

Both servers were started with OpenAI-compatible API endpoints and benchmarked using the same client harness (sglang.bench_serving) with identical request payloads.

Online Serving: Latency at Production Load

The online scenario is the one that matters most for user-facing chat applications. Requests arrive at a fixed rate (RPS = 4 or 8), and the server must schedule them onto GPU time without stalling.

Llama 3.1 8B — Single A100 80G

Metric	SGLang (RPS=4)	vLLM (RPS=4)	Factor	SGLang (RPS=8)	vLLM (RPS=8)	Factor
Median E2E Latency (ms)	1564.17	1691.97	1.08×	2175.02	2137.16	0.98×
Median TTFT (ms)	31.98	100.48	3.1×	35.68	120.39	3.4×
Median TPOT (ms)	13.17	14.14	1.07×	17.85	17.09	0.96×
Median ITL (ms)	11.93	129.32	10.8×	14.41	158.63	11.0×

Llama 3.1 70B — 4 × H100 80G

Metric	SGLang (RPS=4)	vLLM (RPS=4)	Factor	SGLang (RPS=8)	vLLM (RPS=8)	Factor
Median E2E Latency (ms)	3005.24	2915.60	0.97×	4064.98	3752.38	0.92×
Median TTFT (ms)	53.94	179.15	3.3×	58.11	207.12	3.6×
Median TPOT (ms)	25.03	23.58	0.94×	33.07	29.15	0.88×
Median ITL (ms)	21.67	231.23	10.7×	24.45	275.32	11.3×

Key finding: SGLang consistently delivers 3× lower TTFT and 10–11× lower ITL than vLLM under identical online workloads. End-to-end latencies are similar (within 10%) because both frameworks spend roughly the same time on the generation phase — the gap is entirely in scheduling and iteration overhead. [1]

The ITL gap (10–11×) is the most striking. Median inter-token latency measures how long the engine takes between producing one token and the next within a single request’s generation loop. vLLM’s scheduler and CUDA graph recompilation overhead add nearly 10× per-token latency in this metric [1].

Offline Serving: Maximum Throughput

The offline scenario removes request rate limits and maximizes GPU utilization. This is relevant for batch inference pipelines, evaluation suites, and offline data processing.

Model	Engine	Request Throughput (req/s)	Output Token Throughput (tok/s)	Advantage
Llama 3.1 8B	SGLang	22.03	4281.51	—
Llama 3.1 8B	vLLM	21.27	4132.37	SGLang +3.6%
Llama 3.1 70B	SGLang	19.84	3856.01	—
Llama 3.1 70B	vLLM	19.04	3700.64	SGLang +4.2%

Under maximum throughput, the engines converge. SGLang holds a 3–4% advantage in both request and token throughput, but this gap is within measurement noise for most production environments [1]. When latency isn’t a concern, either engine delivers comparable peak throughput.

Architectural Roots of the Latency Gap

The 3–10× latency difference isn’t an implementation bug — it’s baked into each framework’s scheduling architecture.

vLLM’s Scheduler Design

vLLM uses a centralized scheduler with a CUDA graph replay mechanism. On each iteration:

The scheduler computes the next batch by inspecting all waiting requests
It replays the appropriate CUDA graphs for prefill and decode
CUDA graph replay incurs ~10–15µs per graph launch overhead [4]

The --num-scheduler-steps flag (set to 10 in these benchmarks) batches multiple scheduling steps into a single graph replay, but the default implementation still pays per-graph overhead on every iteration. At small batch sizes — common in online serving — this overhead dominates per-token latency.

vLLM v1 architecture (currently in preview) replaces this with a continuous batching design that eliminates per-iteration graph recompilation. Early benchmarks show ITL reductions of 3–4× over v0.6.x [5], but v1 is not yet stable enough for production deployment.

SGLang’s Zero-Overhead Scheduler

SGLang’s scheduler avoids CUDA graph replay overhead by using a radix attention prefix cache and a zero-overhead iteration loop [6]. Key architectural differences:

Radix-based prefix caching: Incoming prompts are matched against cached prefixes, skipping prefill computation for shared segments. Even with --disable-radix-cache in these benchmarks, the underlying iteration loop avoids scheduler stalls.
FlashInfer-backed kernels: SGLang uses FlashInfer for attention, which supports dynamic batch sizes without CUDA graph recompilation within a running batch [6].
Torch.compile integration: For smaller models (8B), torch.compile JIT-compiles the model forward pass, reducing Python-to-CUDA bridge overhead.

Speculative Decoding and Disaggregated Serving

Both frameworks support speculative decoding, but with different performance profiles:

SGLang integrates speculative decoding via EAGLE draft models with adaptive speculative step counts [1]. In benchmark tests, adaptive speculation yields 1.5–2× throughput improvements without degrading acceptance rates [1].
vLLM supports speculative decoding via draft models and n-gram speculation. Recent patches (PR #45100) fix race conditions in async speculative decoding that could cause acceptance count mismatches [7].

Disaggregated prefill/decode — where separate GPU pools handle prompt processing and token generation — is an active area for both projects. SGLang ships it in production; vLLM’s implementation is available as an experimental feature.

Caveats and Methodology Considerations

Benchmark provenance

These results were published by the SGLang team as part of their v0.3.0 release comparison [1]. While the vLLM team has not independently verified these numbers, the reproduction instructions are fully documented. The benchmark script (sglang.bench_serving) is shared code — both engines are tested through the same client harness, reducing measurement bias.

Configuration gaps

Three variables significantly affect the outcome:

GPU memory utilization — vLLM defaulted to 0.9, SGLang to 0.85–0.88. Higher memory utilization can increase TTFT (less room for batching). vLLM was adjusted to 0.88 in offline benchmarks for fairness, but online tests used defaults.
Multi-step scheduling — vLLM’s --num-scheduler-steps 10 was intended to close the gap. It didn’t, suggesting the bottleneck is elsewhere (CUDA graph replay, not scheduler overhead).
Torch.compile — SGLang enabled torch.compile for the 8B model. This is a free win for small models but doesn’t scale to 70B+ due to compilation time.

What these numbers don’t tell you

Hardware portability — These benchmarks ran on NVIDIA A100/H100. Performance on AMD MI300X, Intel Gaudi, or lower-tier GPUs may diverge significantly.
Stability under burst — RPS=4 and RPS=8 are moderate loads. Neither framework has published P99 latency curves under extreme load (RPS > model’s max concurrency).
Cost per token — Throughput-per-dollar depends on GPU utilization, batch packing efficiency, and memory overhead. Neither framework publishes this metric.
Model support breadth — vLLM supports more model architectures out of the box. If you’re serving a niche model, vLLM may be the only option.

Recommendations

Workload pattern	Recommended engine	Rationale
User-facing chat (latency-sensitive)	SGLang	3–10× lower TTFT and ITL
Batch inference (throughput-bound)	Either (vLLM for model support)	Within 4%, choose based on model availability
Mixed online/offline	SGLang	Single engine for both workloads
Niche/experimental models	vLLM	Broader model support matrix
Heavy speculative decoding	SGLang	EAGLE integration with adaptive steps

If you’re running a latency-sensitive online service — particularly with models under 70B — SGLang’s architectural advantages translate directly to user-facing responsiveness. For batch-only workloads or environments where vLLM’s model support is decisive, the throughput gap is negligible.

The real takeaway: LLM serving is still a young field. The 10× ITL gap between these two engines shows how much performance is left on the table by suboptimal scheduling — and how quickly the landscape can shift. Both projects are actively converging toward disaggregated architectures, and today’s winner could be tomorrow’s baseline.

How to Choose: Decision Framework

Follow these steps to pick the right engine for your workload:

Profile your latency requirements: If P99 TTFT under 100ms matters, benchmark SGLang first — its 3–10× ITL advantage is decisive for user-facing chat
Check model support: If you’re serving a niche architecture (MoE variants, custom models), verify vLLM supports it before committing — vLLM has broader out-of-the-box coverage
Run your own reproduction benchmark: Don’t trust vendor numbers — deploy both engines on your hardware, test with your actual prompt distributions using sglang.bench_serving or vllm bench
Evaluate operational complexity: SGLang requires more tuning (radix cache, torch.compile flags); vLLM is more plug-and-play for standard deployments
Plan for migration: Both frameworks support OpenAI-compatible API endpoints — abstract your client layer so you can switch engines without code changes

References

[1] SGLang Project. “Benchmark Results: SGLang v0.3.0 vs vLLM v0.6.0.” GitHub, sgl-project/sglang/benchmark/benchmark_vllm_060. https://github.com/sgl-project/sglang/tree/main/benchmark/benchmark_vllm_060

[2] vLLM Project. “Benchmark CLI Documentation.” vLLM Docs. https://docs.vllm.ai/en/latest/benchmarking/cli/

[3] Reddi, V.J. et al. “MLPerf Inference Benchmark.” arXiv:1911.02549, ISCA 2020. https://arxiv.org/abs/1911.02549

[4] Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. https://dl.acm.org/doi/10.1145/3600006.3613165

[5] vLLM Project. “v1 Engine Architecture.” GitHub, vllm-project/vllm. https://github.com/vllm-project/vllm/issues/39749

[6] Zheng, L. et al. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv:2312.07104. https://arxiv.org/abs/2312.07104

[7] vLLM Project. “PR #45100: Avoid racy accepted counts in async spec decode.” GitHub. https://github.com/vllm-project/vllm/pull/45100

References

[1] (citation needed)
[2] (citation needed)
[3] (citation needed)
[4] (citation needed)
[5] (citation needed)
[6] (citation needed)
[7] (citation needed)