Sorry, you need to enable JavaScript to visit this website.

GPU-Initiated NVMe: Cutting the CPU Out of the Accelerator-to-Storage Data Path

Abstract

Modern AI and HPC workloads move terabytes between GPUs and storage, yet every transfer typically bounces through the CPU -- adding latency, consuming host cycles, and creating a serialization bottleneck. What if the GPU could submit NVMe commands directly from shade code?

This talk presents the architecture of an open-source framework that enables GPU kernels to construct, submit, and complete NVMe I/O operations entirely from device code -- no CPU on the data path. 

We describe how GPU threads build Submission Queue Entries in GPU-resident memory, ring NVMe doorbells via memory-mapped BAR regions, and poll Completion Queue Entries without host intervention. A companion Linux kernel module handles the impedance mismatch between GPU physical addresses and NVMe Physical Region Page requirements through a lightweight kprobe-based injection mechanism.

Beyond NVMe, the framework generalizes GPU-initiated I/O through a unified endpoint abstraction that extends to RDMA NICs and hardware DMA engines, giving developers a consistent API whether moving data to storage, across the network, or between accelerators. We cover the practical challenges -- fine-grained memory coherence across PCIe, wavefront-cooperative command batching for throughput, and address translation for GPU-resident buffers -- along with solutions validated on datacenter hardware.

Attendees will leave understanding how to architect accelerator-initiated storage I/O, the trade-offs between GPU-resident and host-resident queues, and how a pluggable endpoint model can unify heterogeneous I/O paths under a single developer-facing API.