LLM Serving Benchmark: vLLM vs SGLang — Throughput, Latency, and Architecture Tradeoffs
Empirical comparison of vLLM and SGLang on production serving metrics: TTFT, ITL, throughput, and the architectural decisions that drive 3–10x latency differences. Full methodology disclosed.
The choice of LLM serving backend is one of the highest-leverage infrastructure decisions a production AI team makes. The wrong choice means paying 3–10x more per token in user-facing latency, or leaving throughput on the table [1]. But benchmark data is fragmented, vendor-driven, and rarely reproduced.
This post presents an empirical comparison of vLLM and SGLang — the two dominant open-source LLM serving frameworks — on production-relevant metrics: time-to-first-token (TTFT), inter-token latency (ITL), end-to-end latency, and throughput under both online and offline workloads.
Methodology
All benchmark results reproduced here are derived from the SGLang project’s published comparison (v0.3.0 vs vLLM v0.6.0) [1], augmented with reproduction notes from the vLLM project’s own benchmarking infrastructure [2] and the MLPerf Inference reference methodology [3].
Test configuration:
- Model A: Meta Llama 3.1 8B Instruct — single NVIDIA A100 80G
- Model B: Meta Llama 3.1 70B Instruct — 4 × NVIDIA H100 80G (tensor parallel)
- Dataset: ShareGPT (real user conversations, 4096 max model length)
- Workload: Online (bounded request rate) + Offline (max throughput, unlimited concurrency)
- vLLM flags:
--num-scheduler-steps 10, defaultgpu_memory_utilization(0.9) - SGLang flags:
--disable-radix-cache,--enable-torch-compile(8B only) - Metrics: Median TTFT, Median ITL, Median TPOT, Median E2E Latency, Output Token Throughput
Both servers were started with OpenAI-compatible API endpoints and benchmarked using the same client harness (sglang.bench_serving) with identical request payloads.
Online Serving: Latency at Production Load
The online scenario is the one that matters most for user-facing chat applications. Requests arrive at a fixed rate (RPS = 4 or 8), and the server must schedule them onto GPU time without stalling.
Llama 3.1 8B — Single A100 80G
| Metric | SGLang (RPS=4) | vLLM (RPS=4) | Factor | SGLang (RPS=8) | vLLM (RPS=8) | Factor |
|---|---|---|---|---|---|---|
| Median E2E Latency (ms) | 1564.17 | 1691.97 | 1.08× | 2175.02 | 2137.16 | 0.98× |
| Median TTFT (ms) | 31.98 | 100.48 | 3.1× | 35.68 | 120.39 | 3.4× |
| Median TPOT (ms) | 13.17 | 14.14 | 1.07× | 17.85 | 17.09 | 0.96× |
| Median ITL (ms) | 11.93 | 129.32 | 10.8× | 14.41 | 158.63 | 11.0× |
Llama 3.1 70B — 4 × H100 80G
| Metric | SGLang (RPS=4) | vLLM (RPS=4) | Factor | SGLang (RPS=8) | vLLM (RPS=8) | Factor |
|---|---|---|---|---|---|---|
| Median E2E Latency (ms) | 3005.24 | 2915.60 | 0.97× | 4064.98 | 3752.38 | 0.92× |
| Median TTFT (ms) | 53.94 | 179.15 | 3.3× | 58.11 | 207.12 | 3.6× |
| Median TPOT (ms) | 25.03 | 23.58 | 0.94× | 33.07 | 29.15 | 0.88× |
| Median ITL (ms) | 21.67 | 231.23 | 10.7× | 24.45 | 275.32 | 11.3× |
Key finding: SGLang consistently delivers 3× lower TTFT and 10–11× lower ITL than vLLM under identical online workloads. End-to-end latencies are similar (within 10%) because both frameworks spend roughly the same time on the generation phase — the gap is entirely in scheduling and iteration overhead.
The ITL gap (10–11×) is the most striking. Median inter-token latency measures how long the engine takes between producing one token and the next within a single request’s generation loop. vLLM’s scheduler and CUDA graph recompilation overhead add nearly 10× per-token latency in this metric [1].
Offline Serving: Maximum Throughput
The offline scenario removes request rate limits and maximizes GPU utilization. This is relevant for batch inference pipelines, evaluation suites, and offline data processing.
| Model | Engine | Request Throughput (req/s) | Output Token Throughput (tok/s) | Advantage |
|---|---|---|---|---|
| Llama 3.1 8B | SGLang | 22.03 | 4281.51 | — |
| Llama 3.1 8B | vLLM | 21.27 | 4132.37 | SGLang +3.6% |
| Llama 3.1 70B | SGLang | 19.84 | 3856.01 | — |
| Llama 3.1 70B | vLLM | 19.04 | 3700.64 | SGLang +4.2% |
Under maximum throughput, the engines converge. SGLang holds a 3–4% advantage in both request and token throughput, but this gap is within measurement noise for most production environments [1]. When latency isn’t a concern, either engine delivers comparable peak throughput.
Architectural Roots of the Latency Gap
The 3–10× latency difference isn’t an implementation bug — it’s baked into each framework’s scheduling architecture.
vLLM’s Scheduler Design
vLLM uses a centralized scheduler with a CUDA graph replay mechanism. On each iteration:
- The scheduler computes the next batch by inspecting all waiting requests
- It replays the appropriate CUDA graphs for prefill and decode
- CUDA graph replay incurs ~10–15µs per graph launch overhead [4]
The --num-scheduler-steps flag (set to 10 in these benchmarks) batches multiple scheduling steps into a single graph replay, but the default implementation still pays per-graph overhead on every iteration. At small batch sizes — common in online serving — this overhead dominates per-token latency.
vLLM v1 architecture (currently in preview) replaces this with a continuous batching design that eliminates per-iteration graph recompilation. Early benchmarks show ITL reductions of 3–4× over v0.6.x [5], but v1 is not yet stable enough for production deployment.
SGLang’s Zero-Overhead Scheduler
SGLang’s scheduler avoids CUDA graph replay overhead by using a radix attention prefix cache and a zero-overhead iteration loop [6]. Key architectural differences:
- Radix-based prefix caching: Incoming prompts are matched against cached prefixes, skipping prefill computation for shared segments. Even with
--disable-radix-cachein these benchmarks, the underlying iteration loop avoids scheduler stalls. - FlashInfer-backed kernels: SGLang uses FlashInfer for attention, which supports dynamic batch sizes without CUDA graph recompilation within a running batch [6].
- Torch.compile integration: For smaller models (8B), torch.compile JIT-compiles the model forward pass, reducing Python-to-CUDA bridge overhead.
Speculative Decoding and Disaggregated Serving
Both frameworks support speculative decoding, but with different performance profiles:
- SGLang integrates speculative decoding via EAGLE draft models with adaptive speculative step counts [1]. In benchmark tests, adaptive speculation yields 1.5–2× throughput improvements without degrading acceptance rates [1].
- vLLM supports speculative decoding via draft models and n-gram speculation. Recent patches (PR #45100) fix race conditions in async speculative decoding that could cause acceptance count mismatches [7].
Disaggregated prefill/decode — where separate GPU pools handle prompt processing and token generation — is an active area for both projects. SGLang ships it in production; vLLM’s implementation is available as an experimental feature.
Caveats and Methodology Considerations
Benchmark provenance
These results were published by the SGLang team as part of their v0.3.0 release comparison [1]. While the vLLM team has not independently verified these numbers, the reproduction instructions are fully documented. The benchmark script (sglang.bench_serving) is shared code — both engines are tested through the same client harness, reducing measurement bias.
Configuration gaps
Three variables significantly affect the outcome:
- GPU memory utilization — vLLM defaulted to 0.9, SGLang to 0.85–0.88. Higher memory utilization can increase TTFT (less room for batching). vLLM was adjusted to 0.88 in offline benchmarks for fairness, but online tests used defaults.
- Multi-step scheduling — vLLM’s
--num-scheduler-steps 10was intended to close the gap. It didn’t, suggesting the bottleneck is elsewhere (CUDA graph replay, not scheduler overhead). - Torch.compile — SGLang enabled torch.compile for the 8B model. This is a free win for small models but doesn’t scale to 70B+ due to compilation time.
What these numbers don’t tell you
- Hardware portability — These benchmarks ran on NVIDIA A100/H100. Performance on AMD MI300X, Intel Gaudi, or lower-tier GPUs may diverge significantly.
- Stability under burst — RPS=4 and RPS=8 are moderate loads. Neither framework has published P99 latency curves under extreme load (RPS > model’s max concurrency).
- Cost per token — Throughput-per-dollar depends on GPU utilization, batch packing efficiency, and memory overhead. Neither framework publishes this metric.
- Model support breadth — vLLM supports more model architectures out of the box. If you’re serving a niche model, vLLM may be the only option.
Recommendations
| Workload pattern | Recommended engine | Rationale |
|---|---|---|
| User-facing chat (latency-sensitive) | SGLang | 3–10× lower TTFT and ITL |
| Batch inference (throughput-bound) | Either (vLLM for model support) | Within 4%, choose based on model availability |
| Mixed online/offline | SGLang | Single engine for both workloads |
| Niche/experimental models | vLLM | Broader model support matrix |
| Heavy speculative decoding | SGLang | EAGLE integration with adaptive steps |
If you’re running a latency-sensitive online service — particularly with models under 70B — SGLang’s architectural advantages translate directly to user-facing responsiveness. For batch-only workloads or environments where vLLM’s model support is decisive, the throughput gap is negligible.
The real takeaway: LLM serving is still a young field. The 10× ITL gap between these two engines shows how much performance is left on the table by suboptimal scheduling — and how quickly the landscape can shift. Both projects are actively converging toward disaggregated architectures, and today’s winner could be tomorrow’s baseline.
References
[1] SGLang Project. “Benchmark Results: SGLang v0.3.0 vs vLLM v0.6.0.” GitHub, sgl-project/sglang/benchmark/benchmark_vllm_060. https://github.com/sgl-project/sglang/tree/main/benchmark/benchmark_vllm_060
[2] vLLM Project. “Benchmark CLI Documentation.” vLLM Docs. https://docs.vllm.ai/en/latest/benchmarking/cli/
[3] Reddi, V.J. et al. “MLPerf Inference Benchmark.” arXiv:1911.02549, ISCA 2020. https://arxiv.org/abs/1911.02549
[4] Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. https://dl.acm.org/doi/10.1145/3600006.3613165
[5] vLLM Project. “v1 Engine Architecture.” GitHub, vllm-project/vllm. https://github.com/vllm-project/vllm/issues/39749
[6] Zheng, L. et al. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv:2312.07104. https://arxiv.org/abs/2312.07104
[7] vLLM Project. “PR #45100: Avoid racy accepted counts in async spec decode.” GitHub. https://github.com/vllm-project/vllm/pull/45100