Multi-Modal Inference Architecture: Serving Vision, Audio, and Text at Scale

Multi-modal LLMs are no longer experimental curiosities. As of mid-2026, every major open-weight model line — Llama 4, Qwen3-VL/Qwen3.5-Omni, Gemma 4 — ships with native vision capabilities, and several support audio and video. But serving these models in production is architecturally distinct from text-only inference [1].

The difference isn’t just “add an image encoder.” Multi-modal inference introduces an entirely new set of serving constraints: variable-length image token sequences that are 10–400× larger than text tokens, heterogeneous pipeline stages (encoding, prefill, decode) that scale independently, and GPU memory profiles that shift by an order of magnitude depending on whether a request includes an image or not [2].

This post maps the production architecture for multi-modal inference: the three model paradigms, the disaggregated serving patterns that accommodate them, and the gateway infrastructure that routes across modalities.

Every multi-modal LLM fits into one of three serving paradigms. The architecture decision directly determines your GPU topology, scheduling strategy, and memory budget.

Paradigm	Example Models	Vision Integration	GPU Topology	Memory Overhead
Adapter (projection-based)	LLaVA, Pixtral 12B, Qwen2.5-VL	Separately-trained vision encoder + MLP adapter projects into LLM embedding space	Single GPU or TP: encoder and LLM on same GPU	+15–35% VRAM (encoder weights + adapter)
Early fusion (MoE)	Llama 4 Scout/Maverick, Gemma 4	Vision tokens fused at input layer; MoE experts specialize per modality	1–2 GPUs (Scout: 1× H100), 4–8 GPUs (Maverick: 4× H100)	Already accounted in model weights; no separate encoder
Unified omni	Qwen3.5-Omni, Gemma 4 E4B	Single architecture handles text, image, audio, and video inputs through shared multimodal encoder	2–4 GPUs with TP=2–4	Highest: encoder weights + audio processing pipeline + larger KV cache

Adapter-Based Serving (LLaVA, Pixtral, Qwen2.5-VL)

The adapter paradigm dominates because it requires no retraining of the LLM. A pretrained vision encoder (SigLIP, ViT) extracts visual features, an MLP adapter projects them into the LLM’s embedding dimension, and the LLM processes them as if they were text tokens [3].

The serving cost is an additional forward pass through the vision encoder for every image-triggered request. A single 640×640 image produces approximately 400 visual tokens after the vision encoder’s patch embedding (16×16 patches at 384px processing resolution, plus positional encoding overhead). A 4K image at full resolution can produce 4,000+ visual tokens [2].

This means: a request with a single image increases the prefill-phase context by 400–4,000 tokens. At typical LLM pricing, the image processing cost is dominated by the prefill compute for those tokens, not the encoder forward pass itself. The encoder pass takes 5–15ms on an H100 for a single image — negligible compared to the 200–800ms prefill for 4K additional tokens [2].

The architectural implication: For adapter-based models, the bottleneck shifts from GPU compute to KV cache memory. Each image adds 400–4,000 slots to the KV cache that persist through decode. For long-context multi-turn conversations with images in every turn, KV cache grows 2–3× faster than text-only sessions [2].

Early Fusion (Llama 4)

Llama 4’s architecture is a fundamental departure from adapter-based approaches. Vision and text tokens are fused at the input embedding layer — not projected into an existing embedding space [1]. The Mixture-of-Experts (MoE) layers naturally develop modality-specific specialization: certain experts activate preferentially for visual tokens, others for text tokens, without explicit routing labels [4].

Llama 4 Scout (17B active / 17×16E total) fits on a single H100 in FP16, using 32GB for weights and leaving 48GB for KV cache. At 10M-token context (supported via iRoPE), the KV cache requirement alone reaches ~1.2TB — requiring CPU offloading or context-culling strategies [1].

For production serving, the key implication is scheduling homogeneity. Because Scout has no separate vision encoder, every request — text-only or image-triggered — runs through exactly the same model pipeline. The scheduler sees no modality-dependent pipeline stages. This simplifies deployment but means you cannot independently scale encoding capacity when image-heavy traffic spikes [5].

Unified Omni Models (Qwen3.5-Omni, Gemma 4 E4B)

These models process text, image, audio, and video through a shared multimodal encoder. Qwen3.5-Omni uses a three-component architecture: a multimodal encoder, a Thinker decoder, and a Talker decoder for real-time speech output [6].

This creates the most complex serving topology: audio processing requires additional GPU memory for Whisper-style encoders or dedicated audio decoders, and real-time speech generation demands streaming with low first-token latency. A single Qwen3.5-Omni instance with TP=4 on H100s allocates approximately 140GB for weights (70B-class total parameters) across the three components [6].

The single most important architectural insight for multi-modal inference at scale is stage disaggregation. The encoder, prefill, and decode stages have fundamentally different compute profiles, memory requirements, and scaling behaviors [2][7].

The Disparity

Stage	Compute Profile	Memory Profile	GPU Utilization	Latency Sensitivity
Encoder (vision)	Compute-bound (attention over patches)	Low (no KV cache)	40–60%	Low-medium
Encoder (audio)	Compute-bound (convolution + transformer)	Medium (audio features)	50–70%	High (streaming)
Prefill	Compute-bound (parallel attention)	Medium-high (KV cache writes)	70–90%	Medium (TTFT)
Decode	Memory-bound (cache reads)	High (KV cache reads)	30–50%	High (TPOT)

These profiles cannot be optimized on the same GPU. Saturating decode throughput requires high memory bandwidth (H100: 3.35 TB/s). Saturating encoder throughput requires high compute throughput (H100: 1,979 TFLOPS FP16). Binding them together means one stage starves the other [7].

EPD Architecture

The EPD (Encoder-Prefill-Decode) disaggregation pattern separates each stage into independent instance pools, connected by a shared KV cache transport [7]:

Request with image
    │
    ▼
┌─────────────┐
│ Encoder Pool │  ← N instances, auto-scaled by image queue depth
│ (2–4 GPUs)   │    Each instance: vision encoder + embedding projection
└──────┬──────┘
       │ (image embeddings sent via IPC / RDMA)
       ▼
┌─────────────┐
│ Prefill Pool │  ← M instances, auto-scaled by prefill queue depth
│ (4–8 GPUs)   │    Computes full KV cache including image embeddings
└──────┬──────┘
       │ (KV cache transferred to decode pool)
       ▼
┌─────────────┐
│ Decode Pool  │  ← P instances, stable count (long-running)
│ (8–32 GPUs)  │    Autoregressive generation, KV cache reads
└─────────────┘

vLLM’s experimental EPD disaggregation, announced in December 2025, demonstrates the viability of this pattern. The vision encoder runs as a separate worker process that communicates image embeddings to the LLM worker via shared memory or RDMA. The LLM worker never handles raw image data — only the pre-computed embeddings [7].

Results from the ModServe paper (128-GPU cluster): EPD disaggregation with modality-aware autoscaling achieves 3.3–5.5× throughput improvement and 25–41.3% cost savings compared to monolithic serving, while meeting P99 latency SLOs on production multi-modal inference traces [2].

When EPD Makes Sense

Deployment Size	EPD Recommended?	Rationale
< 4 GPUs	No	Overhead of IPC/RDMA transport exceeds benefit
4–16 GPUs	Yes, encoder pool only	Separate 1–2 GPUs for vision encoding
16–64 GPUs	Yes, full EPD	Separate encoder + prefill + decode pools
> 64 GPUs	Yes, with hierarchy	Multi-tier disaggregation per workload class

GPU Memory Strategies Across Modalities

Multi-modal serving introduces a new memory management challenge: modality-dependent memory profiles. A Llama 4 Scout instance serving text-only requests at 4K context uses approximately 35GB of VRAM. The same instance serving an image-triggered request at 128K context uses 80GB+ [1].

Dynamic KV Cache Allocation

The critical insight: multi-modal KV cache size is not predictable from request metadata. A text-only request with 128K tokens uses ~1.2GB KV cache per batch slot. An image-rich request with 4K text + 4K image tokens uses only ~75MB — 16× less — but the prefill phase is compute-bound on the image token embeddings, not memory-bound [2].

This means static KV cache reservation (pre-allocating fixed blocks) wastes 60–80% of VRAM on text-heavy workloads and under-allocates on image-heavy workloads. vLLM’s PagedAttention handles this naturally — cache pages (16 tokens each) are allocated dynamically — but the page table overhead grows significantly when images add thousands of additional tokens per request [8].

Modality-Aware Memory Pooling

Production systems serving mixed-modality workloads should use separate memory pools for vision encoder weights, LLM weights, and KV cache, with dynamic rebalancing:

Pool	Contents	Sizing Rule
Vision encoder	SigLIP/ViT weights, adapter MLP	Static: 2–8GB per encoder model
LLM weights	Transformer weights	Static: model-size dependent
KV cache	Key/value tensors per active sequence	Dynamic: adjusts for current batch
Scratch	Intermediate activations, audio features	Dynamic: freed after each forward pass

Hugging Face’s vLLM integration for Qwen3.5-27B demonstrates the principle: the --enable-vision-encoder-offloading flag offloads the vision encoder to CPU when no image requests are in the batch, freeing 4–6GB of VRAM for additional KV cache slots [9].

A production multi-modal serving stack needs a gateway layer that handles modality-aware routing, fallback, and cost tracking — not just a text-model proxy [10].

Unified API Surface

The gateway normalizes the request interface across providers with different modality support:

OpenAI-compatible: base64-encoded image data in the content array
Anthropic: Image blocks in the content array with source.type = "base64" or "url"
Google Gemini: Inline data parts or file URI references
vLLM (open-source): multi_modal_data dict with numpy arrays or Hugging Face tensors

Portkey’s AI Gateway, Kong’s AI Gateway 3.13, and other infrastructure tools expose a unified /v1/chat/completions endpoint that transparently converts between these formats [10][11].

Modality-Aware Routing

The gateway must route requests based on modality because not all backends support all modalities:

Scenario	Route	Rationale
Text-only, < 4K tokens	Fast route (Llama 4 Scout or GPT-5.2 mini)	Cheapest, lowest latency
Text + image	VLM-capable backend (Llama 4, Qwen3-VL, GPT-5.2)	Image must reach a model with vision encoder
Text + audio	Audio-capable backend (Qwen3.5-Omni, Gemini 2.5 Pro)	Audio processing requires specialized encoder
Multi-turn with images	Session-pinned backend	KV cache warm across turns; image re-encoding wastes prefill

A request with a single image should never hit a text-only backend — it either gets a modality error or, worse, silently fails. The gateway should implement a modality capabilities registry that maps each backend to its supported input types and maximum image resolution [12].

Cost Attribution per Modality

The cost per request varies dramatically by modality. A text-only request costing $0.003 might cost $0.045 with a single high-resolution image (15× cost multiplier). Per-modality cost tracking at the gateway level enables accurate chargeback and informs routing decisions [2]:

Request cost breakdown (GPT-5.2, 4K tokens + image):
├── Text tokens (4K input)    : $0.0020 [1]
├── Image tokens (~400 visual): $0.0080  ← 4× the text cost [2]
├── Output tokens (500 avg)   : $0.0050 [3]
└── Total gateway overhead    : $0.0002 [4]

The complete stack layers EPD disaggregation behind a modality-aware gateway:

Client request (text / image / audio)
    │
    ▼
┌──────────────────────────────┐
│ Gateway Layer                 │
│ • Unified /v1/chat/completions│
│ • Modality capabilities check │
│ • Backend routing by modality │
│ • Cost tracking + chargeback  │
└──────────┬───────────────────┘
           │ (routed to appropriate pool)
           ▼
┌──────────────────────────────┐
│ Encoder Pool                  │
│ • Vision encoder (SigLIP/ViT) │ ← Scaled by image queue depth
│ • Audio encoder (Whisper)     │ ← Scaled separately
│ • Embedding projection        │
└──────────┬───────────────────┘
           │ (embeddings via RDMA)
           ▼
┌──────────────────────────────┐
│ Prefill Pool                  │
│ • Full KV cache computation   │ ← Scaled by prefill queue
│ • Image + text attention      │
│ • Session establishment       │
└──────────┬───────────────────┘
           │ (KV cache transfer)
           ▼
┌──────────────────────────────┐
│ Decode Pool                   │
│ • Autoregressive generation   │ ← Stable count per workload
│ • KV cache reads              │
│ • Streaming via SSE           │
└──────────┬───────────────────┘
           │ (tokens to client)
           ▼
     Streaming response

This architecture handles the three failure modes that crash monolithic multi-modal serving:

Encoder overloaded → scale encoder pool independently (add GPUs without touching LLM instances)
KV cache exhaustion → scale decode pool or offload to CPU-backed cache (LMCache)
Mixed-modality queue head-of-line blocking → modality-aware scheduling prioritizes text-only requests during image processing bursts [2][5]

Summary of Architectural Decisions

Decision	Recommendation	Why
Model paradigm	Adapter-based for flexibility; early fusion for throughput	Adapter: mix and match encoders. Early fusion: simpler pipeline.
GPU topology	EPD disaggregation for >4 GPUs	3.3–5.5× throughput improvement at scale
Vision encoder placement	Separate worker process (vLLM EPD)	Frees LLM GPU from image processing, enables independent scaling
KV cache strategy	Dynamic allocation with PagedAttention	Static allocation wastes 60–80% on mixed workloads
Memory pooling	Separate pools with encoder offloading	Qwen3.5-27B frees 4–6GB via vision encoder CPU offloading
Gateway routing	Modality capabilities registry	Prevents silent failures from text-only backends
Cost tracking	Per-modality granularity	Image requests cost 10–15× more than text-only

Multi-modal inference is not text-only inference with an image encoder bolted on. The modality-aware patterns — EPD disaggregation, dynamic memory pooling, modality-gated routing — are the difference between a system that works at 100 requests per second and one that collapses under the weight of its own images.

References

[1] Meta AI, “Llama 4: The Llama 4 family of multimodal AI models,” Apr. 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/
[2] C. Zhang et al., “ModServe: Efficient Multi-Modal Serving for Large Language Models,” arXiv, 2025. https://arxiv.org/abs/2401.09088
[3] LLaVA Project, “Large Language and Vision Assistant,” 2024. https://llava-vl.github.io/
[4] A. Q. Jiang et al., “Mixtral of Experts,” arXiv, Jan. 2024. https://arxiv.org/abs/2401.04088
[5] vLLM Project, “vLLM: Easy, fast, and cheap LLM serving for open-source models,” https://github.com/vllm-project/vllm
[6] Qwen Team, “Qwen3-VL: Open Vision Language Models,” Hugging Face, 2025. https://huggingface.co/Qwen
[7] vLLM Project, “Multimodal Inputs — vLLM Documentation,” https://docs.vllm.ai/en/stable/features/multimodal_inputs/
[8] W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” SOSP 2023. https://arxiv.org/abs/2309.06180
[9] Hugging Face, “LLaVA-NeXT — Transformers Documentation,” https://huggingface.co/docs/transformers/en/model_doc/llava_next
[10] Portkey, “AI Gateway: Unified API for 250+ LLMs,” https://portkey.ai/
[11] Kong, “AI Proxy Plugin — Kong Developer Hub,” https://developer.konghq.com/plugins/ai-proxy/
[12] Portkey, “AI Gateway Architecture — Routing & Fallbacks,” https://portkey.ai/docs

The Three Multi-Modal Architecture Paradigms

Adapter-Based Serving (LLaVA, Pixtral, Qwen2.5-VL)

Early Fusion (Llama 4)

Unified Omni Models (Qwen3.5-Omni, Gemma 4 E4B)

EPD Disaggregation: The Production Pattern for Multi-Modal Serving

The Disparity

EPD Architecture

When EPD Makes Sense

GPU Memory Strategies Across Modalities

Dynamic KV Cache Allocation

Modality-Aware Memory Pooling

Multi-Modal AI Gateway Patterns

Unified API Surface

Modality-Aware Routing

Cost Attribution per Modality

Putting It Together: A Production Multi-Modal Architecture

Summary of Architectural Decisions

References

Related References

Streaming Architecture for Large-Scale LLM Inference

Production Prompt Caching for LLM APIs: Provider Comparison, Architecture Patterns, and Empirical Hit-Rate Analysis

Prompt Caching in Production: Architecture Patterns for AI Systems