x
loader
LLM Deployment and Optimization — Production Serving at Scale
April 1, 2026 Blog | LLM Engineering 16 min read

LLM Deployment & Optimization — Production Serving at Scale

An engineering team fine-tunes an open-source 70B parameter model that outperforms GPT-4 on their domain-specific tasks. The model runs perfectly during evaluation on their development cluster. Then they attempt to serve it in production: GPU costs exceed $15,000 per month, latency spikes to 8 seconds for simple queries, and the system crashes under moderate concurrent load because the KV cache exhausts GPU memory. The model that was supposed to reduce their dependency on expensive API providers now costs more to operate than the APIs it replaced.

This scenario plays out repeatedly because LLM deployment optimization is a fundamentally different discipline from model development. Training a model that produces good outputs is only half the problem. Serving that model to production users at acceptable latency, throughput, and cost requires a distinct set of engineering skills spanning GPU memory management, inference optimization, quantization techniques, and serving infrastructure design.

At ESS ENN Associates, our AI engineering team deploys LLMs in production environments where performance and cost requirements are non-negotiable. This guide covers the serving frameworks, optimization techniques, infrastructure patterns, and monitoring strategies that determine whether an LLM deployment succeeds or becomes an unsustainable cost center.

Serving Frameworks: vLLM, TGI, and TensorRT-LLM

The serving framework is the foundation of any LLM deployment optimization strategy. Choosing the right framework based on your performance requirements, hardware constraints, and operational complexity tolerance is the first critical decision.

vLLM has emerged as the most widely adopted open-source LLM serving framework, and for good reason. Its core innovation is PagedAttention, which manages the KV cache using a virtual memory system inspired by operating system page tables. Instead of allocating contiguous memory blocks for each request's KV cache (which leads to significant fragmentation and waste), PagedAttention allocates memory in fixed-size pages that can be scattered across GPU memory. This reduces KV cache memory waste by 60-80% compared to static allocation, enabling vLLM to serve 2-4x more concurrent requests on the same hardware. vLLM also implements continuous batching, which dynamically adds new requests to in-progress batches rather than waiting for an entire batch to complete before starting the next. Support for tensor parallelism across multiple GPUs, a wide range of model architectures, and an OpenAI-compatible API make vLLM the default choice for most production deployments.

Text Generation Inference (TGI) from HuggingFace provides a production-ready serving solution with built-in support for safety features, token streaming, and watermarking. TGI implements flash attention and continuous batching for competitive performance. Its integration with the HuggingFace ecosystem makes it straightforward to deploy models directly from the Hub. TGI includes built-in support for quantization with bitsandbytes, GPTQ, and AWQ. For teams already invested in the HuggingFace ecosystem, TGI offers the smoothest deployment path with good performance characteristics.

TensorRT-LLM from NVIDIA delivers the highest single-GPU inference performance through aggressive hardware-specific optimizations. It applies kernel fusion (combining multiple operations into single GPU kernel launches), custom CUDA kernels optimized for each GPU architecture, in-flight batching, and INT8/FP8 quantization that leverages Tensor Cores. TensorRT-LLM can achieve 30-50% higher throughput than vLLM on the same NVIDIA hardware, but requires NVIDIA GPUs exclusively and involves a more complex build and deployment process. For high-volume production workloads where every percentage of throughput improvement translates to meaningful cost savings, TensorRT-LLM justifies the additional operational complexity.

Ollama and llama.cpp serve a different use case: local and edge deployment of quantized models. llama.cpp provides highly optimized CPU inference for GGUF-format models, enabling LLMs to run on laptops, workstations, and edge servers without GPU hardware. Ollama wraps llama.cpp in a user-friendly interface with model management and API serving. These tools are ideal for development, testing, privacy-sensitive deployments where data cannot leave local infrastructure, and applications with low throughput requirements where GPU infrastructure is not cost-justified.

Quantization: GPTQ, AWQ, and GGUF

Quantization is the single most impactful LLM deployment optimization technique. By reducing the numerical precision of model weights, quantization cuts GPU memory requirements, increases inference speed, and enables larger models to run on smaller hardware configurations.

GPTQ (Generalized Post-Training Quantization) applies one-shot weight quantization using approximate second-order information to minimize the quantization error. GPTQ reduces model weights from FP16 (16 bits) to INT4 (4 bits), cutting memory requirements by approximately 4x. A 70B parameter model that requires 140GB in FP16 fits in approximately 35GB with 4-bit GPTQ, enabling it to run on a single 80GB A100 GPU instead of requiring two. GPTQ quantization typically preserves 95-98% of the original model quality on standard benchmarks. The quantization process requires a calibration dataset and takes 1-4 hours depending on model size, but produces a permanently quantized model that loads and runs efficiently without additional runtime overhead.

AWQ (Activation-Aware Weight Quantization) improves on GPTQ by observing that not all weights are equally important for model quality. AWQ identifies weights that correspond to large activation magnitudes — the weights that matter most for model output quality — and preserves these critical weights at higher precision while aggressively quantizing less important weights. This activation-aware approach achieves better quality preservation than GPTQ at the same bit-width, particularly for smaller models (7B-13B) where the accuracy impact of quantization is more pronounced. AWQ has become the preferred quantization method for many production deployments.

GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp and Ollama for CPU-optimized inference. GGUF supports multiple quantization levels from Q2 (2-bit) through Q8 (8-bit), with Q4_K_M being the most common choice for balancing quality and efficiency. GGUF models run efficiently on CPUs using AVX2/AVX-512 instructions and can leverage Apple Silicon's unified memory for GPU-accelerated inference on Mac hardware. For edge deployments, development environments, and applications where GPU hardware is unavailable, GGUF provides the most practical path to running LLMs locally.

FP8 quantization is emerging as a hardware-accelerated alternative on NVIDIA Hopper and Ada Lovelace GPUs. FP8 uses 8-bit floating-point representation rather than integer quantization, preserving the dynamic range of floating-point arithmetic while halving memory from FP16. Because FP8 computation is natively supported by Tensor Cores on these GPUs, it delivers both memory savings and throughput improvements without the quality trade-offs of more aggressive INT4 quantization.

KV Cache Optimization: Managing the Memory Bottleneck

The KV (key-value) cache is the dominant memory consumer during LLM inference and the primary bottleneck limiting concurrent request capacity. Understanding and optimizing KV cache management is essential for effective LLM deployment optimization.

Why the KV cache matters. During autoregressive generation, each transformer layer stores key and value tensors for all previously generated tokens. These cached tensors prevent redundant computation — without caching, generating each new token would require reprocessing the entire sequence from scratch. However, the KV cache grows linearly with both sequence length and batch size. For a 70B model with 80 layers and 128K context length, the KV cache for a single request can consume over 40GB of GPU memory. With multiple concurrent requests, KV cache memory quickly exceeds model weight memory as the primary constraint on system capacity.

PagedAttention in vLLM addresses KV cache fragmentation by allocating cache memory in fixed-size pages rather than contiguous blocks. When a request's context grows beyond its current allocation, new pages are allocated on demand from a pool. When a request completes, its pages are returned to the pool immediately. This eliminates the internal and external fragmentation that wastes 50-70% of allocated KV cache memory in naive implementations.

KV cache compression techniques reduce the per-token memory footprint of the cache. Grouped-query attention (GQA), used in Llama 2/3 and Mistral models, shares key and value heads across multiple query heads, reducing KV cache size by 4-8x compared to standard multi-head attention. Multi-query attention (MQA) takes this further by using a single key-value head, but with greater quality impact. Some serving systems implement dynamic KV cache quantization, storing older cache entries at lower precision than recent ones on the assumption that distant tokens contribute less to current generation.

Prefix caching reuses KV cache entries across requests that share common prefixes, such as system prompts. If 100 concurrent requests all use the same 2,000-token system prompt, prefix caching stores the KV cache for that prompt once and shares it across all requests, saving both memory and the computation needed to process the common prefix. vLLM's automatic prefix caching handles this transparently for workloads with shared prefixes.

Batching Strategies: Maximizing GPU Utilization

GPUs achieve their highest efficiency when processing large batches of data in parallel. Effective batching strategies are critical for maximizing the throughput and cost efficiency of LLM serving.

Static batching collects a fixed number of requests, processes them together, and returns all results before accepting the next batch. This is the simplest approach but wastes GPU time when requests have different generation lengths: short-response requests complete and sit idle while the GPU finishes generating the longest response in the batch. Static batching also introduces queuing latency as requests wait for a full batch to accumulate.

Continuous batching (also called in-flight batching) dynamically manages the batch by immediately releasing completed requests and inserting new ones into the active batch. When a request finishes generation, its KV cache memory is freed and a waiting request takes its place in the batch without interrupting the other in-progress generations. Continuous batching improves throughput by 2-3x over static batching and reduces average latency because new requests begin processing as soon as a batch slot opens rather than waiting for the entire batch to complete.

Speculative decoding is an emerging optimization that uses a small draft model to generate candidate tokens quickly, then verifies multiple candidates simultaneously with the large target model. When the draft model's predictions match the target model's output (which happens 60-80% of the time for a well-chosen draft model), multiple tokens are confirmed per forward pass of the large model. Speculative decoding can improve per-request latency by 2-3x without any quality degradation because the target model always verifies the output. The overhead is running the small draft model, which typically adds less than 10% to total compute.

"The difference between a well-optimized and a naively deployed LLM is not 10 or 20 percent. It is typically a 3-5x difference in throughput and cost efficiency. Teams that treat deployment as an afterthought to model development end up spending more on inference infrastructure than they would have spent on commercial APIs, while delivering worse performance."

— Karan Checker, Founder, ESS ENN Associates

GPU Infrastructure and Cost Management

GPU selection and infrastructure design directly determine the cost structure of LLM serving. Making informed hardware decisions is a core component of LLM deployment optimization.

GPU selection by workload. The NVIDIA A10G (24GB) handles 7B-13B parameter models cost-effectively at approximately $400-600/month on cloud providers. The A100 (80GB) remains the workhorse for 70B models and high-throughput serving at $3,000-6,000/month. The H100 (80GB) delivers 2-3x the throughput of A100 for LLM inference with FP8 support, justifying its higher cost for high-volume workloads. For development and testing, T4 GPUs (16GB) provide the most economical option for smaller models. AMD MI300X offers competitive performance-per-dollar for teams willing to work with the ROCm software stack.

Multi-GPU serving. Models too large for a single GPU require tensor parallelism, which distributes the model's weight matrices across multiple GPUs. Tensor parallelism splits each layer across GPUs with all-reduce communication between them, adding latency proportional to inter-GPU bandwidth. NVLink provides the highest inter-GPU bandwidth (900 GB/s on H100) and is essential for latency-sensitive multi-GPU serving. Pipeline parallelism offers an alternative that splits different layers across different GPUs, reducing communication overhead but complicating batch management.

Cost optimization strategies. Right-sizing GPU instances to actual utilization prevents the common pattern of over-provisioning. Autoscaling based on request queue depth adds capacity during peak traffic and removes it during idle periods. Spot instances on AWS, GCP, or Azure reduce costs by 60-70% for fault-tolerant workloads. Model routing directs simple queries to smaller, cheaper models while reserving large models for complex queries, reducing average per-request cost. Caching common responses eliminates redundant inference for repeated queries.

Self-hosted vs API trade-offs. Commercial APIs (OpenAI, Anthropic, Google) charge per token with zero infrastructure management overhead. Self-hosted serving requires significant engineering investment in infrastructure, optimization, and monitoring. The break-even point depends on volume: at fewer than 1 million tokens per day, APIs are almost always more cost-effective. At 10+ million tokens per day with sustained traffic, self-hosted optimized serving can reduce costs by 50-80% compared to API pricing. The decision also involves data privacy requirements, latency control, model customization needs, and the team's infrastructure engineering capabilities.

Monitoring and Observability for Production LLMs

Production LLM serving requires comprehensive monitoring across performance, resource utilization, quality, and cost dimensions.

Latency metrics. Track time to first token (TTFT), which measures the delay before the first response token appears and directly affects perceived responsiveness. Track inter-token latency (ITL), the time between consecutive tokens, which determines streaming speed. Track end-to-end latency at p50, p95, and p99 percentiles. Set SLA thresholds appropriate to your use case: conversational applications typically require TTFT under 500ms and ITL under 50ms, while batch processing applications can tolerate higher latencies.

Throughput and utilization. Monitor requests per minute, tokens generated per second, and GPU utilization (both compute and memory). Low GPU compute utilization with high memory utilization indicates the system is memory-bound and would benefit from quantization or KV cache optimization. Low utilization across both dimensions indicates over-provisioning. Tracking KV cache occupancy helps predict when the system will start rejecting requests due to memory pressure.

Quality monitoring. LLM output quality can degrade silently due to infrastructure issues, model updates, or drift in input distributions. Monitor output length distributions (sudden changes may indicate generation issues), error and timeout rates, safety filter trigger rates, and user feedback signals. Implement automated evaluation pipelines that regularly test the deployed model against a benchmark set to detect quality regressions early.

Cost tracking. Attribute inference costs to specific product features, customer segments, or use cases. Track cost per request, cost per thousand tokens, and cost per user session. This granular cost attribution enables informed decisions about model routing, feature prioritization, and infrastructure optimization investments. Our AI engineering services team implements comprehensive monitoring dashboards that give engineering and business stakeholders real-time visibility into LLM serving performance and costs.

Frequently Asked Questions

What is the best framework for serving LLMs in production?

vLLM is the most popular choice for high-throughput serving, offering PagedAttention and continuous batching for 2-4x higher throughput than naive serving. TensorRT-LLM provides the highest NVIDIA GPU performance through kernel fusion and hardware optimizations. TGI from HuggingFace offers a balanced approach with easy deployment. For local/edge deployment, Ollama and llama.cpp serve GGUF quantized models on CPU hardware.

How does model quantization reduce LLM serving costs?

Quantization reduces weight precision from 16-bit to 4-bit or 8-bit, cutting GPU memory by 2-4x and increasing throughput by 2-3x. A 70B model requiring two A100 GPUs in FP16 fits on a single GPU with 4-bit AWQ or GPTQ quantization, cutting costs by 50-75% while preserving 95-99% of model quality. GGUF enables CPU inference for smaller models without any GPU hardware.

What is KV cache and why does it matter for LLM performance?

The KV cache stores intermediate attention computations from previous tokens, making generation linear rather than quadratic with sequence length. However, KV cache memory grows with both sequence length and batch size, often consuming more GPU memory than model weights for long-context models. PagedAttention in vLLM reduces KV cache memory waste by 60-80% through page-based allocation.

How much does it cost to serve LLMs in production?

A 7B model on A10G costs $800-1,200/month serving 50-200 requests per minute. A 70B model on 2x A100 costs $6,000-12,000/month. Optimization (quantization, continuous batching, speculative decoding) reduces costs by 50-70%. Self-hosted serving becomes cost-effective versus APIs at roughly 1-5 million tokens per day. Contact our team for a cost analysis based on your specific workload.

What monitoring should I implement for production LLM serving?

Monitor four dimensions: performance (TTFT, tokens/second, p50/p95/p99 latency), resource utilization (GPU memory, compute, KV cache occupancy), quality (output distributions, error rates, user feedback), and cost (per-request, per-token, per-session costs). Set alerts for latency spikes, memory pressure, error rate increases, and throughput drops.

For teams evaluating whether their deployed models are meeting quality standards, our guide on LLM evaluation and benchmarking covers the metrics and frameworks for measuring model quality in production. For organizations building RAG-based systems that depend on optimized LLM serving, see our guide on LLM-powered enterprise search.

At ESS ENN Associates, our AI engineering services team deploys and optimizes LLMs for production workloads, delivering the throughput, latency, and cost targets that make self-hosted LLM serving viable. We bring deep expertise in serving frameworks, quantization, GPU infrastructure, and monitoring to every deployment. If you are deploying LLMs and need to optimize performance and cost, contact us for a free technical assessment.

Tags: LLM Deployment vLLM TensorRT-LLM Quantization KV Cache GPU Inference Model Optimization

Ready to Optimize Your LLM Deployment?

From model quantization and serving infrastructure to GPU optimization and cost management — our AI engineering team deploys production-grade LLM systems that deliver the performance and cost efficiency your business requires. 30+ years of IT services. ISO 9001 and CMMI Level 3 certified.

Get a Free Consultation Get a Free Consultation
career promotion
career
growth
innovation
work life balance