Disaggregated KV Storage: A New Tier for Efficient Scalable LLM Inference
As generative AI models continue to grow in size and complexity, the infrastructure costs of inference—particularly GPU memory and power consumption—have become a limiting factor. This session presents a disaggregated key-value (KV) storage architecture designed to offload KV-cache tensors efficiently, reducing GPU compute pressure while maintaining low-latency, high-throughput inference.
We introduce the first end-to-end system based on shared storage for KV-cache offloading. It integrates with production-scale orchestration frameworks such as Dynamo and Production Stack, enabling scalable deployment across distributed GPU clusters. We provide both theoretical analysis and empirical evaluation, comparing our approach to state-of-the-art inference engines such as vLLM.
Our benchmarks demonstrate 5–8× higher request throughput and 5–7× faster prefill latency compared to baseline systems. Experiments cover a range of GPU types and LLMs, including DeepSeek-V3, and simulate diverse use cases such as multi-turn conversations, long context generation, and agentic workloads.
Unlike traditional block or file storage systems, which are not optimized for the fine-grained, high-frequency access patterns of LLM workloads, our stateless external KV store enables direct GPU-initiated I/O and overlapping of compute and data access, improving efficiency at the infrastructure level.
This session will provide technical insights into system design, performance characteristics, and practical deployment lessons. It is intended for engineers, system architects, and infrastructure practitioners seeking scalable, storage-centric approaches to improve the efficiency and elasticity of LLM inference at scale.