KV-Cache Storage Offloading for Efficient Inference in LLMs
As llm serve more users and generate longer outputs, the growing memory demands of the Key-Value (KV) cache quickly exceed GPU capacity, creating a major bottleneck for large scale inference systems.
In this talk, we discuss KV-cache storage offloading, a novel technique that enables inference acceleration by relocating attention cache data to high speed, low latency storage tiers. This approach alleviates GPU memory constraints and unlocks new levels of scalability for serving large models.