SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA

SNIA Developer

Attend

Why attend

Pricing/register

Hotel and venue

Agenda

Conference schedule

Full Conference Agenda

Speakers

Special Events/Plugfests

Call for presentations

Sponsorship Opportunities

Plugfests/Special Events

SMB3 Plugfest

Cloud Object Storage Plugfest

SNIA Swordfish Plugfest

Present a Birds of a Feather Session

KV-Cache Storage Offloading for Efficient Inference in LLMs

Stevens Creek

Mon Sep 15 | 1:30pm

Abstract

As llm serve more users and generate longer outputs, the growing memory demands of the Key-Value (KV) cache quickly exceed GPU capacity, creating a major bottleneck for large scale inference systems. In this talk, we discuss KV-cache storage offloading, a novel technique that enables inference acceleration by relocating attention cache data to high speed, low latency storage tiers. This approach alleviates GPU memory constraints and unlocks new levels of scalability for serving large models. We’ll dive deep into the architecture of inference workloads, explain the structure and role of the KV-cache, and walk through how storage offloading works in practice. Attendees will gain a clear understanding of: 1. Why external storage is increasingly essential for modern inference workloads 2. What the KV-cache is and why it becomes a bottleneck in large-scale deployments 3. How and when KV-cache storage offloading can improve inference performance

Learning Objectives

Understand the role of the KV-cache in inference and the need for external storage in modern inference workloads Explore how inference engines work and how KV-cache offloading enhances its performance. How and when KV-cache storage offloading can improve inference performance

Ugur Kaynar

Technical Staff,

Dell Technologies

Related Sessions

AI / ML

Enhancing Defect Triaging in Storage Systems Using Generative AI from Integration Test-Based Knowledge Graphs

Bug detection and triaging in complex storage systems pose unique challenges that distinguish them from general-purpose or SaaS-based software. Unlike conventional code which largely operates in a straightforward user space, storage solutions must seamlessly integrate with the operating system kernel, device drivers, and underlying hardware devices. This tight coupling introduces additional complexity in logging, concurrency, and operational flow. For instance, storage systems often span hundreds of threads and processes, each writing into shared log files without conventional transactional guarantees. Such intricate interactions make it difficult for existing AI-based bug-tracking solutions which are typically trained on general codebases to deliver effective results.

To address these limitations, we propose a novel approach that supplements the system code with knowledge extracted from high-level integration test cases. These tests, often written in human-readable scripting languages such as Python, capture end-to-end system behavior more effectively than narrowly focused unit tests. By converting the insights from integration tests into a structured knowledge graph, our methodology provides an AI bug-triaging agent with rich contextual understanding of system interactions, inter-process communications, and hardware events. This deeper, scenario-driven perspective empowers the agent to pinpoint and diagnose issues from storage system failures that would otherwise be hidden in the labyrinth of kernel-mode calls, user-mode processes, and low-level device drivers. Our early findings suggest that this targeted fusion of code analysis and integration-test-based knowledge significantly enhances both the speed and accuracy of bug identification in storage software an advancement poised to transform how complex system bugs are tracked and resolved.

Distinguished Technologist, USA

HPE

AI / ML

AI Driven Mass-Storage Evolution

The extreme growth in modern AI-model training datasets, as well as the explosion of Gen-AI data output are both fueling unprecedented levels of data-storage capacity growth in the datacenters. Such rapid growth in mass-capacity is demanding evolutionary steps in foundational storage technologies to enable higher areal density, optimized data-access interface methodologies and highly efficiency power/cooling infrastructure. We will explore these evolutionary technologies and take a sneak peek at the future of mass data-storage in the AI datacenters.

Mohamad EL-Batal

Office of the CTO Chief Technologist,

Seagate Technology

Thomas Prohofsky

Principal Engineer, CTO Office,

Seagate Technology

AI / ML

Gen6 is coming, but what is Needed from NV Storage?

The rapid advancement of AI is significantly increasing demands on compute, memory and the storage infrastructure. As NVMe storage evolves to meet these needs, it is experiencing a bifurcation in requirements. On one end, workloads such as model training, checkpointing, and key-value (KV) cache tiering are driving the need for line-rate saturating SSDs with near-GPU and HPC attachment. On the other end, the rise of multi-stage inference, synthetic data generation, and post-training optimization is fueling demand for dense, high-capacity disaggregated storage solutions — effectively displacing traditional rotating media in the nearline tier of the datacenter. This paper explores the architectural considerations across both ends of this spectrum, including Gen6 performance, indirection unit (IU) selection, power monitoring for energy efficiency, liquid cooled thermal design, and strategies for enabling high capacity through form factor and packaging choices. We demonstrate how thoughtful design decisions can unlock the full potential of storage systems in addressing the evolving challenges of AI workloads.

Suresh Rajgopal

SSD Systems Architect - DMTS

Micron Technology

AI / ML

Assessing AI Storage Communication Performance At Scale

How do we assess the performance of AI network and storage infrastructure that is critical to the successful deployment of today's complex AI training and inferencing engines? And is it possible to do this without needing to provision racks of expensive GPU Capex? This presentation discusses methodologies and considerations in performing such assessments. We look at different topologies, host and network side considerations and metrics. The performance aspects of NICs/SmartNICs, storage offload processing, switches and interconnects are examined. Benchmarking of AI collective communications with RoCE transport are considered along with the overall impact on training convergence time and network utilization. The operational aspect of commercial networks includes proxies, encapsulations, connection scale and encryption. We discuss their impact on AI training and inferencing.

Cliff Tavares

Senior Director Engineering

Keysight Technologies

Venkat Pullela

Chief Technologist, Networking, USA

Keysight Technologies

AI / ML

Storage Implications for the New Generation of AI Applications

The rate of change in the structure and capabilities of applications has never been as high as in the last year. There's a huge shift from stockpiling data cheaply to leveraging data to create insight with GenAI and to capitalize on business leads with predictive AI. Excitement and opinions about where storage matters run rampant. Thankfully, we can "follow the data" to pinpoint whether storage performance is critical in the compute node or just in the back end, discern the relative importance of bandwidth and latency, determine whether the volume and granularity of accesses is suitable for a GPU, and what the range of granularities of accesses are. Walking through recent developments in AI apps and their implications will lead to insights that are likely to surprise the audience.

There are new opportunities to create innovative solutions to these challenges. The architectures of NAND and their controllers may adjust to curtail ballooning power with more efficient data transfers and error checking. IOPs optimizations that will be broadly mandatory in the future may be pulled in to benefit some applications now. New hardware/software codesigns may lead to protocol changes, and trade-offs in which computing agents and data structures are best suited to accomplish new goals. Novel software interfaces and infrastructure enable movement, access, and management of data that is tailored to the specific needs of each application. Come join in a fun, refreshing, provocative, and interactive session on storage implications for this new generation of AI applications!

CJ Newburn

Distinguished Engineer,

NVIDIA

Wen-Mei Hwu

Senior Distinguished Research Scientist and Senior Research Director,

NVIDIA

AI / ML

Small Granularity Graph Neural Network Training and the Future of Storage

Current SSD devices are mostly built with a 4KiB transaction unit, or even larger for bigger drive capacities. But what if your workload has high IOPs demands at smaller granularities? We will take a deep dive into our GNN testing using NVIDIA BaM and the modifications that we made to test smaller than 4K transactions. We will also discuss how this workload is a good example of the need for Storage Next.

John Mazzie

MTS, Systems Performance Engineer

Micron Technology

SNIA Developer Conference September 15-17, 2025

SDC 2025 is brought to you by SNIA. SNIA is an industry association committed to its mission of worldwide leadership developing and promoting architectures, standards, education and vendor-neutral collaboration.