As LLM inference scales to multi-GPU deployments, the KV cache becomes a critical bottleneck. Each GPU maintains its own isolated cache that can spill to host DRAM, so identical prompt prefixes processed by different devices can result in redundant prefill computation. CXL fabric-attached memory (FAM) offers a shared, byte-addressable pool accessible by multiple GPUs, enabling KV cache entries computed by one device to be reused by others.
In this talk, we present our experience integrating CXL FAM-based KV cache sharing into an open-source multi-GPU inference framework. We describe the architecture: a block-level cache manager allocating in CXL memory, sequence-hash-based cross-GPU cache discovery, an asynchronous transfer engine that moves blocks between CXL and GPU memory without stalling the inference scheduler, and the CUDA host-registration workflow for GPU-CXL DMA. We then focus on the engineering challenges that shaped the design — wrapping a C++ CXL allocator for safe use in an async Rust runtime, avoiding scheduler-thread deadlocks when performing blocking CUDA and CXL operations under GPU pressure, and a subtle batch-splitting bug that silently dropped sequence hashes and broke the cross-GPU lookup path entirely. Early results show CXL FAM matches host DRAM baseline performance and that enabling sharing meaningfully reduces time to first token, particularly at tail latencies. Attendees should have familiarity with LLM inference concepts (prefill, decode, KV cache) and basic CXL memory semantics.