Multi-Modal Inference Architecture: Serving Vision, Audio, and Text at Scale
A production architecture deep dive on multi-modal LLM serving — adapter vs early fusion vs unified architectures, EPD disaggregation for vision encoders, GPU memory strategies across modalities, and the gateway patterns that unify text, image, and audio inference.
Multi-modal LLMs are no longer experimental curiosities. As of mid-2026, every major open-weight model line — Llama 4, Qwen3-VL/Qwen3.5-Omni, Gemma 4 — ships with native vision capabilities, and several support audio and video. But serving these models in production is architecturally distinct from text-only inference [1].
The difference isn’t just “add an image encoder.” Multi-modal inference introduces an entirely new set of serving constraints: variable-length image token sequences that are 10–400× larger than text tokens, heterogeneous pipeline stages (encoding, prefill, decode) that scale independently, and GPU memory profiles that shift by an order of magnitude depending on whether a request includes an image or not [2].
This post maps the production architecture for multi-modal inference: the three model paradigms, the disaggregated serving patterns that accommodate them, and the gateway infrastructure that routes across modalities.
The Three Multi-Modal Architecture Paradigms
Every multi-modal LLM fits into one of three serving paradigms. The architecture decision directly determines your GPU topology, scheduling strategy, and memory budget.
| Paradigm | Example Models | Vision Integration | GPU Topology | Memory Overhead |
|---|---|---|---|---|
| Adapter (projection-based) | LLaVA, Pixtral 12B, Qwen2.5-VL | Separately-trained vision encoder + MLP adapter projects into LLM embedding space | Single GPU or TP: encoder and LLM on same GPU | +15–35% VRAM (encoder weights + adapter) |
| Early fusion (MoE) | Llama 4 Scout/Maverick, Gemma 4 | Vision tokens fused at input layer; MoE experts specialize per modality | 1–2 GPUs (Scout: 1× H100), 4–8 GPUs (Maverick: 4× H100) | Already accounted in model weights; no separate encoder |
| Unified omni | Qwen3.5-Omni, Gemma 4 E4B | Single architecture handles text, image, audio, and video inputs through shared multimodal encoder | 2–4 GPUs with TP=2–4 | Highest: encoder weights + audio processing pipeline + larger KV cache |
Adapter-Based Serving (LLaVA, Pixtral, Qwen2.5-VL)
The adapter paradigm dominates because it requires no retraining of the LLM. A pretrained vision encoder (SigLIP, ViT) extracts visual features, an MLP adapter projects them into the LLM’s embedding dimension, and the LLM processes them as if they were text tokens [3].
The serving cost is an additional forward pass through the vision encoder for every image-triggered request. A single 640×640 image produces approximately 400 visual tokens after the vision encoder’s patch embedding (16×16 patches at 384px processing resolution, plus positional encoding overhead). A 4K image at full resolution can produce 4,000+ visual tokens [2].
This means: a request with a single image increases the prefill-phase context by 400–4,000 tokens. At typical LLM pricing, the image processing cost is dominated by the prefill compute for those tokens, not the encoder forward pass itself. The encoder pass takes 5–15ms on an H100 for a single image — negligible compared to the 200–800ms prefill for 4K additional tokens [2].
The architectural implication: For adapter-based models, the bottleneck shifts from GPU compute to KV cache memory. Each image adds 400–4,000 slots to the KV cache that persist through decode. For long-context multi-turn conversations with images in every turn, KV cache grows 2–3× faster than text-only sessions [2].
Early Fusion (Llama 4)
Llama 4’s architecture is a fundamental departure from adapter-based approaches. Vision and text tokens are fused at the input embedding layer — not projected into an existing embedding space [1]. The Mixture-of-Experts (MoE) layers naturally develop modality-specific specialization: certain experts activate preferentially for visual tokens, others for text tokens, without explicit routing labels [4].
Llama 4 Scout (17B active / 17×16E total) fits on a single H100 in FP16, using 32GB for weights and leaving 48GB for KV cache. At 10M-token context (supported via iRoPE), the KV cache requirement alone reaches ~1.2TB — requiring CPU offloading or context-culling strategies [1].
For production serving, the key implication is scheduling homogeneity. Because Scout has no separate vision encoder, every request — text-only or image-triggered — runs through exactly the same model pipeline. The scheduler sees no modality-dependent pipeline stages. This simplifies deployment but means you cannot independently scale encoding capacity when image-heavy traffic spikes [5].
Unified Omni Models (Qwen3.5-Omni, Gemma 4 E4B)
These models process text, image, audio, and video through a shared multimodal encoder. Qwen3.5-Omni uses a three-component architecture: a multimodal encoder, a Thinker decoder, and a Talker decoder for real-time speech output [6].
This creates the most complex serving topology: audio processing requires additional GPU memory for Whisper-style encoders or dedicated audio decoders, and real-time speech generation demands streaming with low first-token latency. A single Qwen3.5-Omni instance with TP=4 on H100s allocates approximately 140GB for weights (70B-class total parameters) across the three components [6].
EPD Disaggregation: The Production Pattern for Multi-Modal Serving
The single most important architectural insight for multi-modal inference at scale is stage disaggregation. The encoder, prefill, and decode stages have fundamentally different compute profiles, memory requirements, and scaling behaviors [2][7].
The Disparity
| Stage | Compute Profile | Memory Profile | GPU Utilization | Latency Sensitivity |
|---|---|---|---|---|
| Encoder (vision) | Compute-bound (attention over patches) | Low (no KV cache) | 40–60% | Low-medium |
| Encoder (audio) | Compute-bound (convolution + transformer) | Medium (audio features) | 50–70% | High (streaming) |
| Prefill | Compute-bound (parallel attention) | Medium-high (KV cache writes) | 70–90% | Medium (TTFT) |
| Decode | Memory-bound (cache reads) | High (KV cache reads) | 30–50% | High (TPOT) |
These profiles cannot be optimized on the same GPU. Saturating decode throughput requires high memory bandwidth (H100: 3.35 TB/s). Saturating encoder throughput requires high compute throughput (H100: 1,979 TFLOPS FP16). Binding them together means one stage starves the other [7].
EPD Architecture
The EPD (Encoder-Prefill-Decode) disaggregation pattern separates each stage into independent instance pools, connected by a shared KV cache transport [7]:
Request with image
│
▼
┌─────────────┐
│ Encoder Pool │ ← N instances, auto-scaled by image queue depth
│ (2–4 GPUs) │ Each instance: vision encoder + embedding projection
└──────┬──────┘
│ (image embeddings sent via IPC / RDMA)
▼
┌─────────────┐
│ Prefill Pool │ ← M instances, auto-scaled by prefill queue depth
│ (4–8 GPUs) │ Computes full KV cache including image embeddings
└──────┬──────┘
│ (KV cache transferred to decode pool)
▼
┌─────────────┐
│ Decode Pool │ ← P instances, stable count (long-running)
│ (8–32 GPUs) │ Autoregressive generation, KV cache reads
└─────────────┘
vLLM’s experimental EPD disaggregation, announced in December 2025, demonstrates the viability of this pattern. The vision encoder runs as a separate worker process that communicates image embeddings to the LLM worker via shared memory or RDMA. The LLM worker never handles raw image data — only the pre-computed embeddings [7].
Results from the ModServe paper (128-GPU cluster): EPD disaggregation with modality-aware autoscaling achieves 3.3–5.5× throughput improvement and 25–41.3% cost savings compared to monolithic serving, while meeting P99 latency SLOs on production multi-modal inference traces [2].
When EPD Makes Sense
| Deployment Size | EPD Recommended? | Rationale |
|---|---|---|
| < 4 GPUs | No | Overhead of IPC/RDMA transport exceeds benefit |
| 4–16 GPUs | Yes, encoder pool only | Separate 1–2 GPUs for vision encoding |
| 16–64 GPUs | Yes, full EPD | Separate encoder + prefill + decode pools |
| > 64 GPUs | Yes, with hierarchy | Multi-tier disaggregation per workload class |
GPU Memory Strategies Across Modalities
Multi-modal serving introduces a new memory management challenge: modality-dependent memory profiles. A Llama 4 Scout instance serving text-only requests at 4K context uses approximately 35GB of VRAM. The same instance serving an image-triggered request at 128K context uses 80GB+ [1].
Dynamic KV Cache Allocation
The critical insight: multi-modal KV cache size is not predictable from request metadata. A text-only request with 128K tokens uses ~1.2GB KV cache per batch slot. An image-rich request with 4K text + 4K image tokens uses only ~75MB — 16× less — but the prefill phase is compute-bound on the image token embeddings, not memory-bound [2].
This means static KV cache reservation (pre-allocating fixed blocks) wastes 60–80% of VRAM on text-heavy workloads and under-allocates on image-heavy workloads. vLLM’s PagedAttention handles this naturally — cache pages (16 tokens each) are allocated dynamically — but the page table overhead grows significantly when images add thousands of additional tokens per request [8].
Modality-Aware Memory Pooling
Production systems serving mixed-modality workloads should use separate memory pools for vision encoder weights, LLM weights, and KV cache, with dynamic rebalancing:
| Pool | Contents | Sizing Rule |
|---|---|---|
| Vision encoder | SigLIP/ViT weights, adapter MLP | Static: 2–8GB per encoder model |
| LLM weights | Transformer weights | Static: model-size dependent |
| KV cache | Key/value tensors per active sequence | Dynamic: adjusts for current batch |
| Scratch | Intermediate activations, audio features | Dynamic: freed after each forward pass |
Hugging Face’s vLLM integration for Qwen3.5-27B demonstrates the principle: the --enable-vision-encoder-offloading flag offloads the vision encoder to CPU when no image requests are in the batch, freeing 4–6GB of VRAM for additional KV cache slots [9].
Multi-Modal AI Gateway Patterns
A production multi-modal serving stack needs a gateway layer that handles modality-aware routing, fallback, and cost tracking — not just a text-model proxy [10].
Unified API Surface
The gateway normalizes the request interface across providers with different modality support:
- OpenAI-compatible:
base64-encoded image data in thecontentarray - Anthropic: Image blocks in the
contentarray withsource.type = "base64"or"url" - Google Gemini: Inline data parts or file URI references
- vLLM (open-source):
multi_modal_datadict with numpy arrays or Hugging Face tensors
Portkey’s AI Gateway, Kong’s AI Gateway 3.13, and other infrastructure tools expose a unified /v1/chat/completions endpoint that transparently converts between these formats [10][11].
Modality-Aware Routing
The gateway must route requests based on modality because not all backends support all modalities:
| Scenario | Route | Rationale |
|---|---|---|
| Text-only, < 4K tokens | Fast route (Llama 4 Scout or GPT-5.2 mini) | Cheapest, lowest latency |
| Text + image | VLM-capable backend (Llama 4, Qwen3-VL, GPT-5.2) | Image must reach a model with vision encoder |
| Text + audio | Audio-capable backend (Qwen3.5-Omni, Gemini 2.5 Pro) | Audio processing requires specialized encoder |
| Multi-turn with images | Session-pinned backend | KV cache warm across turns; image re-encoding wastes prefill |
A request with a single image should never hit a text-only backend — it either gets a modality error or, worse, silently fails. The gateway should implement a modality capabilities registry that maps each backend to its supported input types and maximum image resolution [12].
Cost Attribution per Modality
The cost per request varies dramatically by modality. A text-only request costing $0.003 might cost $0.045 with a single high-resolution image (15× cost multiplier). Per-modality cost tracking at the gateway level enables accurate chargeback and informs routing decisions [2]:
Request cost breakdown (GPT-5.2, 4K tokens + image):
├── Text tokens (4K input) : $0.0020
├── Image tokens (~400 visual): $0.0080 ← 4× the text cost
├── Output tokens (500 avg) : $0.0050
└── Total gateway overhead : $0.0002
Putting It Together: A Production Multi-Modal Architecture
The complete stack layers EPD disaggregation behind a modality-aware gateway:
Client request (text / image / audio)
│
▼
┌──────────────────────────────┐
│ Gateway Layer │
│ • Unified /v1/chat/completions│
│ • Modality capabilities check │
│ • Backend routing by modality │
│ • Cost tracking + chargeback │
└──────────┬───────────────────┘
│ (routed to appropriate pool)
▼
┌──────────────────────────────┐
│ Encoder Pool │
│ • Vision encoder (SigLIP/ViT) │ ← Scaled by image queue depth
│ • Audio encoder (Whisper) │ ← Scaled separately
│ • Embedding projection │
└──────────┬───────────────────┘
│ (embeddings via RDMA)
▼
┌──────────────────────────────┐
│ Prefill Pool │
│ • Full KV cache computation │ ← Scaled by prefill queue
│ • Image + text attention │
│ • Session establishment │
└──────────┬───────────────────┘
│ (KV cache transfer)
▼
┌──────────────────────────────┐
│ Decode Pool │
│ • Autoregressive generation │ ← Stable count per workload
│ • KV cache reads │
│ • Streaming via SSE │
└──────────┬───────────────────┘
│ (tokens to client)
▼
Streaming response
This architecture handles the three failure modes that crash monolithic multi-modal serving:
- Encoder overloaded → scale encoder pool independently (add GPUs without touching LLM instances)
- KV cache exhaustion → scale decode pool or offload to CPU-backed cache (LMCache)
- Mixed-modality queue head-of-line blocking → modality-aware scheduling prioritizes text-only requests during image processing bursts [2][5]
Summary of Architectural Decisions
| Decision | Recommendation | Why |
|---|---|---|
| Model paradigm | Adapter-based for flexibility; early fusion for throughput | Adapter: mix and match encoders. Early fusion: simpler pipeline. |
| GPU topology | EPD disaggregation for >4 GPUs | 3.3–5.5× throughput improvement at scale |
| Vision encoder placement | Separate worker process (vLLM EPD) | Frees LLM GPU from image processing, enables independent scaling |
| KV cache strategy | Dynamic allocation with PagedAttention | Static allocation wastes 60–80% on mixed workloads |
| Memory pooling | Separate pools with encoder offloading | Qwen3.5-27B frees 4–6GB via vision encoder CPU offloading |
| Gateway routing | Modality capabilities registry | Prevents silent failures from text-only backends |
| Cost tracking | Per-modality granularity | Image requests cost 10–15× more than text-only |
Multi-modal inference is not text-only inference with an image encoder bolted on. The modality-aware patterns — EPD disaggregation, dynamic memory pooling, modality-gated routing — are the difference between a system that works at 100 requests per second and one that collapses under the weight of its own images.
References
[1] Meta AI, “The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI,” April 2025. ai.meta.com/blog/llama-4-multimodal-intelligence/
[2] Qiu et al., “ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving,” arXiv:2502.00937, February 2025. arxiv.org/abs/2502.00937
[3] BentoML Blog, “Multimodal AI: A Guide to Open-Source Vision Language Models,” December 2025. bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models
[4] Red Hat Developers, “Llama 4 Herd Is Here with Day 0 Inference Support in vLLM,” April 2025. developers.redhat.com/articles/2025/04/05/llama-4-herd-here-day-zero-inference-support-vllm
[5] Spheron Network, “Deploy Vision Language Models on GPU Cloud: Qwen3-VL, Llama 4 Scout,” April 2026. spheron.network/blog/deploy-vision-language-models-gpu-cloud/
[6] Spheron Network, “Deploy Qwen3.5-Omni on GPU Cloud: Self-Host Real-Time Multimodal AI,” April 2026. spheron.network/blog/deploy-qwen3-5-omni-gpu-cloud/
[7] vLLM Blog, “Encoder Disaggregation for Scalable Multimodal Model Serving,” December 2025. vllm.ai/blog/2025-12-15-vllm-epd
[8] vLLM GitHub, “RFC: Prototype Separating Vision Encoder to Its Own Worker,” Issue #20799, July 2025. github.com/vllm-project/vllm/issues/20799
[9] Hugging Face, “Qwen/Qwen3.5-27B Model Card,” March 2026. huggingface.co/Qwen/Qwen3.5-27B
[10] Portkey Blog, “Bringing Multimodal Models to Production with an AI Gateway,” June 2025. portkey.ai/blog/multimodal-models-to-production-with-an-ai-gateway/
[11] Kong Inc., “Move More Agentic Workloads to Production with AI Gateway 3.13,” December 2025. konghq.com/blog/product-releases/ai-gateway-3-13
[12] Truefoundry, “Multi-Model Routing: Optimize AI Tasks Efficiently,” 2026. truefoundry.com/blog/multi-model-routing
📖 Related Reads
- NiteAgent — AI agent development, frameworks, and production patterns
Cross-links automatically generated from CodeIntel Log.