Sorry, you need to enable JavaScript to visit this website.

Scaling Inference with KV Cache Storage Offload and RDMA Accelerated Architecture

Grand Mesa F

Wed Apr 29 | 2:10pm

Abstract

As LLMs become central to applications such as conversational AI, document processing, agentic workflows, and RAG, inference systems must support longer context windows and deeper interaction patterns. A major scalability limiter in these workloads is the KV Cache, which expands with context length and can rapidly exceed aggregated GPU memory capacity.
In this talk, we will focus on KV Cache and present how storage backed KV cache offloading leveraging high performance networked storage systems enables scalable inference beyond GPU memory limits. We highlight how RDMA enabled data paths and low latency storage significantly accelerate cached data movement, reducing end to end inference latency while supporting larger, more complex workloads.
We also share multi turn inference benchmarking results that expose the challenges of context accumulation in real world interaction in real world interaction sequences, including human computer multi turn interactions. As LLMs increasingly rely on iterative, multi step interactions, evaluating these real world workloads is essential for understanding system behavior and designing scalable inference architectures.

Download PDF