LLM Serving in Production: vLLM, Triton, and the Token Throughput Wars

Hero image showing GPU hardware with token streams and latency metrics overlay

Introduction

There's a large gap between running a language model and serving a language model. Running it means calling an inference API and getting a response. Serving it means operating a system that handles thousands of concurrent requests, manages GPU memory efficiently, routes traffic intelligently, meets latency SLAs, and delivers economically viable cost per token — at scale, reliably, continuously.

Most teams start in the first category. As usage grows, they eventually encounter the economics and operational complexity of the second. The transition is more substantial than it appears.

This post is for engineers who've moved beyond prototype AI applications and need to understand the infrastructure layer: how LLM serving frameworks work, what the key metrics are, how the major options compare, and how to design a serving architecture that doesn't become the bottleneck or the budget line item that kills the project.

LLM Serving Architecture Overview

The Serving Problem

Language model inference has unusual resource characteristics compared to traditional application serving.

Memory dominates compute for large models. A 70B parameter model in fp16 requires approximately 140GB of GPU memory just to store the weights — before any batching or KV cache. Even a 7B model requires ~14GB. GPU memory is expensive and limited: an H100 80GB card costs $30,000–$40,000. Memory efficiency is directly correlated with cost efficiency.

The KV cache is the performance bottleneck. During inference, the model computes key-value attention tensors for each token in the context. These need to be retained across generation steps (so you don't recompute from scratch for every new token). The KV cache grows linearly with sequence length and batch size. On a long context with a large batch, the KV cache can consume more memory than the model weights themselves. Managing KV cache efficiently is the central challenge of production LLM serving.

Throughput and latency are in tension. Batching multiple requests together dramatically improves GPU utilization and throughput. But it adds latency — a request that arrives when others are already being processed waits for the batch to be compiled. The optimal batching strategy depends on your workload and SLAs.

Generation is autoregressive — one token at a time. Standard inference generates one token per forward pass. A 500-token response requires 500 forward passes. This is fundamentally different from classification or embedding models that produce output in a single pass. It means time-to-first-token (TTFT) and inter-token latency (ITL) are the metrics users experience, not just total response time.

The Key Metrics

Before comparing serving frameworks, understand the metrics that matter:

Time to First Token (TTFT): how long after the request arrives before the first token is generated. Determined primarily by the prefill phase (processing the input). Critical for interactive use cases where users are watching text stream in.

Inter-Token Latency (ITL): time between each successive token during generation. Determines the "typing speed" experience. Should be consistent — variable ITL feels choppy to users.

Throughput (tokens/second): total tokens generated per second across all concurrent requests. The production efficiency metric — higher throughput means lower cost per token.

Request latency (P95/P99): end-to-end time for complete requests at the 95th and 99th percentile. The SLA metric. P95 matters more than mean for real user experience.

GPU memory utilization: what fraction of available GPU memory is actively used. Low utilization means you're paying for capacity you're not using. High utilization reduces room for traffic spikes.

vLLM: The Efficient Memory Management Baseline

vLLM, released by the Berkeley Sky Computing Lab in 2023, became the de facto standard for open-source LLM serving by solving the KV cache management problem elegantly.

PagedAttention: vLLM's core innovation, borrowed from operating system paging. Instead of pre-allocating a contiguous block of GPU memory for each request's KV cache (which leads to massive internal fragmentation — you reserve space for the maximum sequence length even when requests are short), PagedAttention allocates memory in small, fixed-size pages and maps non-contiguous physical memory to logical attention positions.

The result: near-zero internal fragmentation. A server that previously wasted 60-80% of KV cache memory on fragmentation now uses that memory productively. The practical effect is 2–4× more requests served concurrently on the same hardware.

Continuous batching: traditional batching adds all new requests to the same batch and waits for all to complete before accepting new requests. Continuous batching inserts new requests into the batch dynamically — when a request in the current batch finishes, a new request immediately takes its slot. This dramatically improves throughput under variable request lengths.

# Running vLLM with OpenAI-compatible API
# pip install vllm

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=1,      # Number of GPUs for tensor parallelism
    gpu_memory_utilization=0.90, # Reserve 10% for overhead
    max_model_len=8192,          # Maximum context length
    dtype="bfloat16",            # Compute dtype (bf16 is usually fastest on H100/A100)
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

prompts = [
    "Explain the difference between a mutex and a semaphore.",
    "What are the SOLID principles in object-oriented design?",
    "How does a transformer's attention mechanism work?",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text}")

Production deployment: vLLM ships an OpenAI-compatible REST API server:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --port 8000

This makes it a drop-in replacement for the OpenAI API in applications that use the OpenAI SDK.

Speculative Decoding: The 2026 Throughput Multiplier

Speculative decoding is the most impactful throughput improvement in production serving over the past two years. The technique exploits the asymmetry between generation speed (slow — sequential, one token at a time) and verification speed (fast — can verify many tokens in a single forward pass).

A small draft model (typically 1B-3B parameters) quickly generates a speculative sequence of several tokens. The large target model then verifies the entire speculative sequence in a single parallel forward pass. Tokens that match are accepted. The first rejected token and all subsequent tokens are discarded, and the target model generates the correct continuation from the rejection point.

In practice, speculative decoding achieves 2–4× throughput improvement for structured outputs (code, JSON, formulaic text) where the draft model's guesses are frequently correct. For highly creative or open-ended generation, the improvement is smaller (the draft model's guesses are rejected more often).

# vLLM with speculative decoding
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",     # Target model
    speculative_model="meta-llama/Llama-3-8B-Instruct",  # Draft model
    num_speculative_tokens=5,    # How many tokens to speculate ahead
    tensor_parallel_size=4,
)

sequenceDiagram participant C as Client participant D as Draft Model (fast) participant T as Target Model (slow) C->>T: Request: "Generate code..." loop Speculative decoding T->>D: Current context D-->>T: Speculative tokens [a, b, c, d, e] T->>T: Verify all 5 tokens in ONE forward pass alt All accepted T-->>C: Stream tokens a, b, c, d, e else Partial accept (a, b accepted, c rejected) T-->>C: Stream tokens a, b T->>T: Generate correct token at position c T-->>C: Stream corrected token end end

Triton Inference Server: The Enterprise Option

NVIDIA's Triton Inference Server takes a different approach from vLLM. Where vLLM is focused specifically on LLM serving efficiency, Triton is a general-purpose model serving platform that supports PyTorch, TensorFlow, ONNX, TensorRT, and custom backends — including LLMs through its TensorRT-LLM backend.

TensorRT-LLM: NVIDIA's LLM inference library, tightly optimized for NVIDIA hardware. Uses quantization (FP8, INT8, INT4), kernel fusion, and hardware-specific optimizations that extract maximum performance from H100 and A100 GPUs. In benchmarks, TensorRT-LLM with TensorRT optimization typically achieves 20-40% higher throughput than vLLM on NVIDIA hardware, at the cost of significantly higher operational complexity.

When to choose Triton over vLLM:

  • You need to serve multiple model types (not just LLMs) from a single serving platform
  • You're on NVIDIA hardware and need to extract maximum performance
  • You have the operational capacity to handle more complex deployment pipelines
  • You're doing serious quantization (INT4, INT8, FP8) and need the hardware-level optimizations

When vLLM is the right choice:

  • Team is optimizing for deployment speed and operational simplicity
  • Hardware-agnostic deployment is required (vLLM supports AMD ROCm, some Intel XPU)
  • The memory efficiency improvements (PagedAttention) are more valuable than compute efficiency

Serving Architecture Patterns

Single-Server Deployment

For moderate traffic, a single high-memory server with multiple GPUs handles most use cases. An H100 SXM5 node with 8 GPUs (640GB total memory) can serve 70B models with excellent throughput. Tensor parallelism distributes the model across GPUs within the server; no network communication between servers required.

Multi-Server Deployment with Load Balancer

For high-traffic applications, deploy multiple inference servers behind a load balancer. Each server runs the full model (or uses tensor parallelism across its local GPUs). The load balancer distributes requests using least-connections or round-robin routing.

Key consideration: sticky sessions. Requests with the same prefix (system prompt + conversation history) benefit from prefix caching if routed to the same server. A load balancer that's aware of prefixes can route related requests to the same server, improving cache hit rates.

# Example: prefix-aware load balancing
import hashlib
from typing import Optional

def get_server_for_request(
    conversation_id: Optional[str],
    system_prompt_hash: str,
    servers: list[str]
) -> str:
    """
    Route requests with the same conversation to the same server
    for better prefix cache utilization.
    """
    routing_key = conversation_id or system_prompt_hash
    server_index = int(hashlib.md5(routing_key.encode()).hexdigest(), 16) % len(servers)
    return servers[server_index]

Disaggregated Prefill and Decode

An emerging architecture for optimizing both TTFT and throughput simultaneously: separate the prefill phase (compute-intensive, processes the entire input context in parallel) from the decode phase (memory-bandwidth intensive, generates tokens sequentially).

Prefill servers use compute-optimized instances (H100 SXM). Decode servers use memory-optimized instances. Requests hit a prefill server to process the input, then transfer their KV cache to a decode server for generation. This allows each server type to be sized independently and allows more aggressive batching of the prefill phase.

This architecture is operationally complex but can achieve 30-50% cost reduction at high scale. Companies like Groq are exploring even more extreme disaggregation with dedicated hardware.

Quantization: Trading Precision for Efficiency

Quantization reduces model weight precision from 16-bit floating point (fp16/bf16) to lower precision (INT8, INT4, FP8). The model weights consume less memory, more of the model fits in GPU memory, and memory bandwidth requirements drop — resulting in higher throughput at the cost of small quality degradation.

| Format | Memory (70B model) | Quality | Throughput |

|--------|-------------------|---------|------------|

| fp16 | ~140GB | Baseline | Baseline |

| INT8 | ~70GB | ≈98% of fp16 | ~1.5× |

| INT4 (GPTQ/AWQ) | ~35GB | ≈95-97% of fp16 | ~2.5× |

| INT4 + fp16 KV cache | ~35GB weights + variable KV | ≈95-97% | ~2.5× |

| FP8 (H100 native) | ~70GB | ≈99% of fp16 | ~1.8× on H100 |

For production workloads, INT8 quantization is typically the right default — minimal quality degradation with meaningful efficiency gains. INT4 is appropriate where cost matters more than maximum quality (summarization, classification, code completion where re-runs are cheap).

Cost Analysis: Build vs. Buy

The classic infrastructure decision. For LLM serving, the break-even calculation is relatively clear:

API providers (Anthropic, OpenAI, Google): no infrastructure management, no GPU capital expenditure, pay per token. Makes sense for low-to-moderate volume, for accessing frontier models you can't run yourself, and when your team doesn't have ML infrastructure expertise.

Self-hosted open-source models: high upfront investment (GPU hardware or cloud GPU reservations), ongoing operational cost (engineering time, monitoring). Economically viable above approximately 1M tokens per day per model, where the cost savings over API providers justify the operational complexity.

Break-even calculation (rough):
API cost for 1M tokens/day (claude-haiku-4-5): ~$0.25-$0.80/day
Self-hosted H100 instance cost: ~$30-35/hr → ~$720-840/day
Break-even: ~900M-3.4B tokens/day per GPU

Conclusion: self-hosting is economically compelling only at very high volume,
or for use cases requiring data privacy or very low latency.

Production Considerations

Health checks and graceful degradation: model loading takes minutes. Implement liveness probes that distinguish "model loading" from "model unhealthy." Under load, implement circuit breakers that reject new requests rather than queuing indefinitely when latency exceeds thresholds.

Prompt caching: for applications with shared prefixes (system prompts, few-shot examples, document contexts that don't change), enable prefix caching. vLLM supports automatic prefix caching. The KV cache for repeated prefixes is computed once and reused, dramatically reducing TTFT and compute cost for the decode phase.

Monitoring and alerting: beyond standard application metrics, monitor model-specific signals: token generation rate, KV cache utilization, queue depth and wait time, output quality scores from automated evaluation. Set alerts on KV cache utilization above 90% (performance degrades sharply as cache fills).

Conclusion

LLM serving in production is infrastructure that requires deliberate design. PagedAttention, continuous batching, speculative decoding, and quantization are not optimizations to add later — they're the foundation of economically viable serving at scale.

For most teams: start with the OpenAI-compatible vLLM API server with PagedAttention and continuous batching enabled. Add prefix caching immediately if your workload has repeated prefixes. Evaluate speculative decoding once your traffic patterns are established. Consider quantization when memory pressure or cost become constraints.

The teams that get this right are those that treat model serving as infrastructure — not as an API call that will scale automatically.

Sources & References

1. vLLM Team — "Efficient Memory Management for LLM Serving with PagedAttention" — https://arxiv.org/abs/2309.06180

2. NVIDIA — "TensorRT-LLM" — https://developer.nvidia.com/tensorrt

3. DeepMind — "Speculative Decoding" — https://arxiv.org/abs/2302.01318

4. Anyscale — "How continuous batching enables 23× throughput" — https://www.anyscale.com/blog/continuous-batching-llm-inference

5. vLLM Documentation — https://docs.vllm.ai/


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained