SNIA Developer

Attend

Why attend

Pricing/register

Hotel and venue

Agenda

Conference schedule

Full Conference Agenda

Speakers

Call for presentations/Present a Birds of a Feather session

Sponsorship Opportunities

Plugfests/Birds of a Feather

SMB3 Plugfest

Cloud Object Storage Plugfest

STA Storage Plugfest

Present a Birds of a Feather Session

Toward GPU‑Agnostic Storage: RDMA-Accelerated Data Movement Across Heterogeneous GPU Systems

Stevens Creek

Mon Sep 28 | 8:30am

Abstract

AI/ML and data-intensive workloads are increasingly constrained by CPU-mediated data movement between storage and GPUs. While GPU-direct approaches reduce overhead, they are largely vendor-specific, limiting portability across heterogeneous environments.

This session presents a practical approach to GPU-agnostic storage using RDMA-enabled data paths to move data directly between disaggregated storage and GPU memory across NVIDIA, AMD, and Intel platforms. We focus on designing storage as a first-class participant in GPU data pipelines, eliminating host staging buffers to reduce latency, CPU overhead, and data copies.

We will cover key architecture considerations, including NFS over RDMA and emerging S3 over RDMA models, GPU memory registration and addressability, and secure, efficient exposure of GPU buffers in a vendor-neutral framework. On the client side, we compare cuFile, ROCm, and oneAPI/Level Zero and present patterns for building portable abstractions.

We also examine performance tradeoffs across vendors, including memory pinning overheads, DMA behavior, and impacts on throughput, latency, and GPU utilization.

Jason Goldschmidt

Distinguished Engineer

Dell

Rate this Session

Description

Related Sessions

Data Movement & Placement

Concepts for HDD Data-Placement Optimization for Data Lakes

As hard disk drives (HDDs) continue to scale in capacity and complexity, traditional block‑based host storage abstractions increasingly obscure device‑level characteristics that are critical to performance, efficiency, and reliability. This paper examines HDD virtualization through the lens of host‑side intelligence, focusing on how explicit host‑provided hints and data placement awareness can be used to bridge this abstraction gap. We explore mechanisms such as logical zoning, data placement hints, and Flexible Data Placement (FDP) to enable more informed interactions between the host and the drive. By exposing workload intent—such as data temperature, update frequency, and locality—the host can guide layout decisions that better align logical data structures with physical media behavior. The discussion emphasizes the host‑side architecture and software considerations required to generate, manage, and convey these hints, while also outlining how corresponding drive‑side capabilities can act on them. Together, these techniques enable more predictable performance, improved write efficiency, and enhanced longevity in large‑scale HDD deployments, particularly in virtualized and Cloud & Object Storage environments.

Curtis Stevens

Technologist

Seagate Technology

Data Movement & Placement

FDP: The Data Placement Promise of Modern NVMe SSDs

NVMe Flexible Data Placement (FDP) is rapidly emerging in commercial SSDs and early hyperscale deployments as the next evolution of host visible SSD placement control. Unlike prior approaches that relied on strict sequential write constraints or host managed garbage collection, FDP introduces Reclaim Units that enable explicit data placement while preserving the conventional block interface and backward compatibility with existing software stacks. As adoption accelerates, developers and storage practitioners need a clearer understanding of how FDP behaves on real hardware and how existing systems can exploit it effectively.

This talk presents a hands on evaluation of production FDP capable NVMe SSDs, examining their placement guarantees, reclaim behavior, and performance characteristics across both raw block devices and filesystem based deployments on Linux. We further demonstrate how lifetime aware data separation can be integrated into two widely deployed open source storage systems, MySQL and RocksDB, with minimal architectural changes and without requiring application rewrites.

Using synthetic and real world workloads, we show that FDP can deliver up to 4× improvements in write amplification factor and throughput, along with up to 5× reductions in read and write latency. We will discuss the practical lessons learned while deploying FDP in real software stacks, the current state of Linux and open source ecosystem support, and the implications for future storage system design.

Data Movement & Placement

Resiliency for AI Workloads

When your AI jobs fail at scale, where does the blame really lie—the model, the hardware, or the layers underneath that move and manage data? This session dives into how modern storage and execution infrastructure can turn fragile AI pipelines into resilient, self-healing systems.

AI Checkpointing and Storage Hierarchy

AI - Training state spans terabytes across thousands of devices—far beyond traditional workloads. Effective checkpointing should ideally exploit the storage hierarchy (DRAM, NVMe, parallel, object storage), with an intelligent layer automating tiering, async flushing, and ensure consistency across application ranks.

Checkpointing Alone Isn't Enough

At scale, coarse checkpoint intervals risk losing hours of compute; fine intervals create continuous I/O pressure. Fine-grained retry isolates failures to a shard or pipeline stage, avoiding full restarts—but requires the execution framework and storage system to jointly maintain recovery state.

Shared Responsibility

Resilience spans the entire AI pipeline—from data ingestion to model serving. Storage, execution frameworks, communication libraries, and applications must each provide well-defined guarantees. We'll outline what an ideal resilient AI storage and execution stack should look like.

Clarete Crasta

Principal Engineer

HPE

Data Movement & Placement

Toward GPU‑Agnostic Storage: RDMA-Accelerated Data Movement Across Heterogeneous GPU Systems

We also examine performance tradeoffs across vendors, including memory pinning overheads, DMA behavior, and impacts on throughput, latency, and GPU utilization.

Jason Goldschmidt

Distinguished Engineer

Dell

Data Movement & Placement

GPU-Initiated NVMe: Cutting the CPU Out of the Accelerator-to-Storage Data Path

Modern AI and HPC workloads move terabytes between GPUs and storage, yet every transfer typically bounces through the CPU -- adding latency, consuming host cycles, and creating a serialization bottleneck. What if the GPU could submit NVMe commands directly from shade code?

This talk presents the architecture of an open-source framework that enables GPU kernels to construct, submit, and complete NVMe I/O operations entirely from device code -- no CPU on the data path.

We describe how GPU threads build Submission Queue Entries in GPU-resident memory, ring NVMe doorbells via memory-mapped BAR regions, and poll Completion Queue Entries without host intervention. A companion Linux kernel module handles the impedance mismatch between GPU physical addresses and NVMe Physical Region Page requirements through a lightweight kprobe-based injection mechanism.

Beyond NVMe, the framework generalizes GPU-initiated I/O through a unified endpoint abstraction that extends to RDMA NICs and hardware DMA engines, giving developers a consistent API whether moving data to storage, across the network, or between accelerators. We cover the practical challenges -- fine-grained memory coherence across PCIe, wavefront-cooperative command batching for throughput, and address translation for GPU-resident buffers -- along with solutions validated on datacenter hardware.

Attendees will leave understanding how to architect accelerator-initiated storage I/O, the trade-offs between GPU-resident and host-resident queues, and how a pluggable endpoint model can unify heterogeneous I/O paths under a single developer-facing API.

Stephen Bates

Fellow

AMD

Data Movement & Placement

Characterizing and Emulating FDP SSDs with WARP

Flexible Data Placement (FDP) is the NVMe interface that hyperscalers such as Google and Meta have championed to reduce write amplification without the invasive application changes OpenChannel and ZNS required. Major vendors are now shipping FDP-enabled SSDs. But FDP is a best-effort interface, not a guarantee: in our measurements, an adversarial three-stream workload with FDP enabled reaches 4.49× WAF on one commercial drive and 2.58× on another — the reasons are shaped by vendor-specific firmware policy that the host cannot see.

This talk closes that visibility gap. Drawing on the first cross-device, cross-workload study of commercial FDP SSDs — two PCIe Gen5, NVMe 2.1 drives from different vendors, evaluated across synthetic microbenchmarks, CacheLib production traces from Meta, and F2FS filesystem workloads — we show when FDP delivers near-1 WAF and when it fails.

We identify two previously unreported behaviors. Noisy RUH: invalidations concentrated in one reclaim unit handle inflate write amplification across other handles, breaking the isolation FDP is meant to provide. We observe this pattern on both commercial devices. Save Sequential: firmware GC heuristics can prematurely reclaim long sequential streams, so even capacity-dominant sequential traffic can end up as the largest contributor to WAF. Together, these effects show how FDP's benefits can erode even when host classification looks correct.

We also report a result relevant to F2FS users: in our ten-hour Fileserver runs, 99% of user data writes were tagged WARM and funneled into a single RUH, collapsing FDP back to conventional SSD behavior. F2FS does separate node and metadata segments into different RUHs, but this separation alone is insufficient — user data needs finer-grained classification for FDP's benefit to materialize. For CacheLib, the picture is different: FDP holds WAF near 1.0 without degrading hit ratio, and a simple small-RU optimization further reduces CacheLib's WAF from 1.37 to 1.16 at 40% SOC.

To make these effects reproducible and to explore policies real firmware does not expose, we built WARP (Write Amplification Research Platform), the first open FDP emulator. WARP reproduces the WAF trends of both commercial devices while exposing configurable policies hidden in real hardware: Initially Isolated vs. Persistently Isolated semantics, RU size, over-provisioning ratio, RUH count, and GC policy. Using WARP, we map the II vs. PI tradeoff and show that PI outperforms II only above a device-dependent OP threshold (~7–9% for 256 MB RUs); below that, II is more resilient under limited slack. WARP is upstreamed to FEMU and available for community use.

This work was done in collaboration with Samsung Electronics and Western Digital. Attendees working on flash caches, filesystems, and SSD firmware will leave with concrete, measurement-backed guidance on when FDP helps, when it does not, and how configuration choices shape the outcome.

SNIA Developer Conference September 28-30, 2026

SDC 2026 is brought to you by SNIA. SNIA is an industry association committed to its mission of worldwide leadership developing and promoting architectures, standards, education and vendor-neutral collaboration.