Capacity Begets Intelligence at Scale | SNIA

Abstract

Large language model inference is rapidly turning into a memory-hierarchy problem. A single GPU's HBM holds only seconds of working set; the moment a returning session, a long-context prompt, or a parallel chain-of-thought spills out of HBM, the system pays a recompute tax measured in seconds of time-to-first-token and tens of milliseconds of inter-token latency. Extending the hierarchy with DRAM, local NVMe, and a shared cross-node KV tier turns that recompute tax into a storage retrieval and changes the central question from "how fast can we compute attention?" to "how fast can we remember?"

Prior single-GPU work introduced the CRAFT framework: Comprehension, Recall, Adaptability, Fluency, and Tenacity, to score the inference stack the way cognitive science scores a human reasoner. On a single GPU with one NVMe tier under HBM, that work reported up to 78× faster time-to-first-token on long contexts, 10× throughput recovery once the working set exceeded HBM, 21× lower median inter-token latency under load, and a jump from 7% to 83% on the AIME-2024 math benchmark when reasoning budget was unconstrained by capacity.

This talk takes CRAFT off the single-GPU bench and onto a real distributed inference fleet. We deploy a multi-node vLLM cluster on a shared NVMe-class KV tier (an NVIDIA CMX architecture, as introduced at GTC 2026) in a working AI inference lab, with disaggregated prefill and decode workers and production workloads viz. agentic tool loops, long-context multi-turn assistants, and asynchronous batch reasoning. We re-measure each CRAFT dimension at fleet scale, alongside hit-rate, tail latency, and decode goodput, across realistic levels of multi-initiator fan-in into the shared KV pool.

From the measured workload we then derive the storage QoS profile that this class of system actually demands, expressed in the language storage architects use: large-block random reads at modest per-drive queue depth; p99.99 read-tail latency under sustained concurrent write pressure; and the fan-in rule that keeps the system out of queueing collapse.

Attendees will leave with a vendor-neutral measurement framework they can apply to any inference stack, a reference workload profile useful to SSD architects and shared-KV-tier designers, and a concrete picture of how storage decisions propagate all the way up into model throughput, decode smoothness, and end-to-end answer quality.

This is a developer talk. All measurements, configurations, and analysis methodology will be presented in sufficient detail to reproduce.

Kapil Karkra

Senior Principal Engineer, AI Storage and Software Solutions