LLM Router Architecture — Production Routing for Multi-Model Systems

Every production AI system with more than one model endpoint needs a router. Not a load balancer — a router. The difference matters: load balancers distribute identical requests across homogeneous backends, while routers decide which model should handle which request based on cost, complexity, domain, and latency constraints.

Enterprise LLM API spend passed $8.4B in 2025, and the gap between an un-routed system and an intelligently routed one is typically 40–70% cost savings without measurable quality degradation [1]. This essay examines the engineering architecture behind production LLM routers — the algorithms, the data structures, the system topologies, and the hard tradeoffs that determine whether a routing layer actually helps or becomes another source of latency.

The Routing Taxonomy

Production routers fall into five strategy classes, each with distinct architectural implications:

Complexity-based routing classifies each request’s difficulty and sends simple queries to cheap models and hard queries to frontier models. The classifier is typically a lightweight embedding model (BERT-tiny, ~15M params) trained on a labeled dataset of queries with ground-truth difficulty scores. Inference cost per query: ~2–5ms on CPU, ~0.3ms on GPU.

Cost-based routing selects the cheapest model that clears a quality bar for the task type. This requires a cost-per-model matrix and a per-task quality threshold — both maintained through continuous eval pipelines. The router becomes a constraint-satisfaction problem: minimize cost subject to quality(model, task) >= threshold.

Cascading routing tries the cheapest model first, then escalates only if the output fails a quality check. This is architecturally distinct because it requires synchronous verification — the quality gate must evaluate the response before returning it, adding 50–200ms of latency per escalation hop. Cascading works well only when the cheap model succeeds on >80% of requests, keeping the escalation path a rare event [1].

Semantic routing maps queries to model endpoints through embedding similarity. A BERT-based encoder converts the query into a 768-d vector, and a nearest-neighbor search against precomputed task or model embeddings selects the target [2]. Red Hat’s semantic router implements exactly this pattern: an Envoy ExtProc filter intercepts OpenAI-compatible requests, generates embeddings via Rust’s Candle library running a distilled BERT model, and compares against stored task vectors to route to math-specific, code-specific, or general-purpose backends [2].

Domain routing is the simplest — it uses explicit classification (code → code models, vision → multimodal, creative writing → GPT-4o). Implementation is a hash map or a multiclass classifier. Low overhead, low flexibility.

Gateway Architectures: The Production Layer

The router doesn’t exist in isolation. It sits inside a gateway that adds caching, failover, observability, and governance. The gateway architecture determines the router’s real-world performance.

Bifrost: Go-Native, Microsecond Overhead

Bifrost, written in Go and open-sourced under Apache 2.0, claims 11 microseconds of added latency at 5,000 requests per second [3]. This is not marketing — it’s achievable because Go’s goroutine model handles concurrent I/O without the GIL contention that plagues Python-based gateways. Bifrost’s architecture is multi-layered:

Request → TLS termination → Auth → Rate Limiter → Router → Provider Pool
                                                      ↓
                                              Semantic Cache (in-memory)
                                                      ↓
                                              Fallback Chain (ordered)

At 5K RPS, Python-based gateways (LiteLLM, Portkey) add hundreds of microseconds to low single-digit milliseconds per request, with latency spikes under load from GC pressure [3]. The Go advantage is structural, not situational — it comes from zero-copy connection handling and lock-free cache access patterns.

The semantic cache is the most latency-sensitive component. Bifrost’s cache uses cosine similarity search over an in-memory FAISS index, returning hits in ~5ms versus 2,000ms+ for a full round-trip to GPT-4o [3]. Cache eviction uses a TTL-based policy (configurable per route), not LRU — because semantically similar queries tend to cluster in time, TTL eviction produces better hit rates for most production workloads.

Red Hat’s Semantic Router: Envoy + Rust + Go

Red Hat’s semantic processor targets a different architectural niche: existing Envoy-based service meshes. The implementation uses an External Processor (ExtProc) filter that operates on HTTP request bodies:

Envoy intercepts the HTTP request at the gateway layer
The ExtProc filter (Go server) extracts the user prompt
A Rust FFI call generates BERT embeddings via the Candle library
Embeddings are compared against task vectors using cosine similarity
The router modifies the upstream Host header to target the chosen model backend
Prometheus metrics fire on every decision path: model selection, cache hit/miss, latency percentile, token count [2]

The Rust+Go split is deliberate: Rust via Candle provides CPU-efficient BERT inference (no Python dependency, no GIL), while Go handles protocol-level ExtProc communication. The system exposes Prometheus metrics for every decision path — model selection distribution, cache hit ratios, response latencies by route, and token usage per backend.

LLMRouter: The Training Pipeline

LLMRouter from UIUC takes a different approach — it pre-trains routing models on benchmark data rather than relying on at-runtime embedding similarity. The training pipeline spans 11 datasets (MMLU, GSM8K, MATH, HumanEval, etc.) and generates per-model performance vectors for every candidate LLM [4].

The router models themselves are diverse: KNN, SVM, MLP, matrix factorization, Elo rating, graph-based, and BERT-based classifiers. Each router type captures a different signal. MLP routers learn feature interactions between query embeddings and model performance vectors. Elo-based routers treat model selection as a pairwise comparison problem — useful when the candidate pool changes frequently.

The key architectural insight from LLMRouter is that training data generation dominates the system cost. For each of the 11 datasets, every candidate model must generate responses, which are then evaluated against ground truth. With 30+ candidate models, this is 330 model-dataset evaluations. The pipeline parallelizes via 100 worker threads and takes hours even on GPU clusters [4].

Fallback Topologies

Routing without fallback is a single point of failure. Production gateways implement one of three fallback patterns:

Linear fallback (most common): Provider pool with ordered preferences. On timeout or 429, retry with exponential backoff through the chain. If all providers fail, return a 503 with a “models overloaded” error. LiteLLM and Bifrost both default to this topology.

Parallel fallback: Fan out the same request to two providers simultaneously; return the first complete response. More expensive (2× token cost) but minimizes latency tail. Used for latency-critical paths where provider diversity is high.

Degradation fallback: If the best model is down, route to a weaker model but flag the response as degraded. The application layer can then log, alert, or re-route through a slower path. This is the safest pattern — it never drops requests, never doubles cost, and surfaces degradation explicitly in observability.

The Bifrost benchmark of 11μs overhead at 5K RPS assumes linear fallback with in-memory state. Parallel fallback adds at least 2× the upstream latency because the slowest provider in the fan-out set determines response time.

Semantic Caching: The Hidden Lever

Semantic caching is the most impactful non-routing feature in production gateways. The economics are straightforward: if a semantically similar query was answered recently, return the cached response instead of hitting a model [2][3].

The implementation challenge is index freshness. A FAISS index of query embeddings, updated on every cache miss, requires re-indexing once the insert buffer exceeds ~10K entries. Different implementations handle this differently:

Bifrost uses a fixed-size in-memory HNSW graph with TTL-based eviction — no re-indexing needed, but recall degrades slowly as entries age out.
Red Hat’s router delegates caching to Envoy’s dedicated cache layer, keeping the semantic processor stateless.
LLMRouter doesn’t cache at the routing layer — it caches per-model responses using standard prompt caching.

The right cache size depends on request diversity. For a customer support gateway serving 50K daily requests with ~60% repeated intents, a 5K-entry cache hits 35–40% of queries [1]. For a code generation gateway where queries are highly diverse, cache hit rates rarely exceed 15%.

Measuring Router Quality

A router that degrades output quality is worse than no router. The standard eval framework treats the router as a decision model and measures:

Routing accuracy: For each decision class (easy/hard, domain A/B), does the router select the correct model? Measured against a held-out labeled dataset. LLMRouter’s KNN router achieves ~89% accuracy on MMLU routing tasks [4].

Quality delta: For the same query, does the routed model’s output score within 5% of the frontier model’s output on task-specific metrics (BLEU, pass@k, F1)? If not, the routing strategy is too aggressive. [1]

Cost efficiency: Total token cost before and after routing. The industry benchmark is 40–70% savings without quality degradation [1]. Below 30% savings, the routing layer may not justify its operational overhead.

Latency budget: P50 and P99 routing latency. At 5K RPS, routing overhead adds 11μs for Go-based gateways [3], 500μs–3ms for Python-based ones. For interactive workloads, the gateway should add less than 5% to total request latency.

The Build vs. Buy Decision

The gateway market has matured past the “write a Python script” era. Three production-grade open-source options exist: LiteLLM (Python, 33K GitHub stars), Bifrost (Go, Apache 2.0), and Kong AI Gateway (Lua/Go, enterprise). Managed options include OpenRouter, Cloudflare AI Gateway, and Portkey [5].

The decision framework is straightforward:

Use LiteLLM for low-throughput prototyping, early-stage startups, and teams already in the Python ecosystem. For workloads under 500 RPS, the Python overhead is acceptable.

Use Bifrost for high-throughput production workloads (>1K RPS), latency-sensitive paths (interactive chat, real-time agents), and any deployment that needs air-gapped or VPC-isolated operation. The 11μs overhead matters at scale.

Build your own only when you have unique routing logic (e.g., custom quality models, data-residency constraints that prevent using any external gateway). Even then, fork an open-source gateway rather than starting from scratch [1]. The reference architecture is straightforward: a Go or Rust proxy with an embedding model, a semantic cache, and a provider pool.

The Routing Taxonomy

Gateway Architectures: The Production Layer

Bifrost: Go-Native, Microsecond Overhead

Red Hat’s Semantic Router: Envoy + Rust + Go

LLMRouter: The Training Pipeline

Fallback Topologies

Semantic Caching: The Hidden Lever

Measuring Router Quality

The Build vs. Buy Decision

References

Related References

Prompt Caching in Production: Architecture Patterns for AI Systems

The Hidden Architecture of LLM Routing: From if-else to Production Gateways

Automated Test Generation with LLMs: Production Patterns and Empirical Quality Benchmarks