Sorry, you need to enable JavaScript to visit this website.

KV Cache as Distributed Storage: What Disaggregated Inference Demands from the Storage Stack

Abstract

Disaggregating prefill and decode — running them on separate node pools — is now production reality. Mooncake, SGLang, and vLLM all ship disaggregated backends. The architecture works.

What disaggregation introduces, however, is a set of storage system requirements that are still being worked out. When the KV cache must cross a network, be shared across decode nodes, survive eviction to lower tiers, be looked up by future requests, and eventually be freed — it behaves as a distributed storage object. The protocols, indexing, consistency models, and garbage collection schemes that managing it demands are active areas of development across both the inference systems and storage communities.

A node holding KV cache between user turns is occupying accelerator memory for a storage function — keeping state alive — regardless of what compute is happening on that same node. Once you see that, the questions that follow are storage questions: where does the data live, how do you find it, when do you free it. 

Moving KV cache across nodes — the transfer layer — is where those questions begin. Different protocols make different trade-offs on latency, compatibility, and how efficiently they handle KV cache that is spread across non-contiguous memory blocks. But transfer alone answers only the first question. The rest belong to the storage system.

The bulk of the talk maps four storage problems that disaggregated inference surfaces: tiering and eviction policy for inference access patterns, distributed prefix indexing with eviction consistency, consistency semantics for immutable content-addressed cache objects, and distributed garbage collection for shared cache blocks with multiple concurrent holders. For each, we look at what exists in the research literature, what is partial, and where the open questions are.

The storage and distributed systems community has prior art on all four. The goal of this talk is to make that connection legible.