SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
The exploration of computation near flash storage has been prompted by the advent of network-attached flash-based storage enclosures operating at tens of gigabytes/sec, server memory bandwidths struggling to keep up with network and aggregate I/O bandwidths, and the ever-growing need for massive data storage, management, manipulation and analysis. Multiple tasks from distributed analytical/indexing functions to data management tasks like compression, erasure encoding, and deduplication are all potentially more performant, efficient and economical when performed near storage devices. The emerging NVMe Computational Storage standard requires real-world computational storage offload demonstrations to ensure the standard evolves in a useful direction that enables appropriate task offloads for end-user sites. Demonstrating the enablement of a standard’s-based ecosystem for offloading computation to near data storage is a valued contribution to the computing, networking, and storage communities. The goal of the joint Accelerated Box of Flash (ABOF) project (a collaboration between Eideticom, Nvidia, Aeon, SK hynix, and LANL) was to produce a first version of a network-attached computational storage system that allows a host application to directly leverage distributed and programmable computational elements in the ABOF without hiding any of the computation behind a block storage interface. Applications of interest include user-based computation, such as analytics acceleration for popular data formats, and kernel-based computation including functions common to file systems. Both use cases leverage distributed computational offloads near storage. The team has chosen to initially accelerate a commonly deployed kernel-based file system, ZFS, to appeal to the large ZFS community while also making it easy for vendors to deploy accelerated ZFS appliances and creating interesting business opportunities. The technical details of how this solution works, the useful artifacts produced, and the lessons learned from developing and testing this ABOF will be presented by the partners. Some background knowledge on NVMe, computational storage, and disaggregated storage would be beneficial to the audience.
Large-scale simulations at Los Alamos can produce petabytes of data per timestep, yet the scientific focus often lies in narrow regions of interest—like a wildfire’s leading edge. Traditional HPC tools read entire datasets to extract these key features, resulting in significant inefficiencies in time, energy, and resource usage. To address this, Los Alamos—in collaboration with Hammerspace and SK hynix—is leveraging computational storage to process data closer to its source, enabling selective access to high-value information. This talk introduces a new pushdown architecture for pNFS, built on open-source tools such as Presto, DuckDB, Substrait, and Apache Arrow. It distributes complex query execution across application and storage layers, significantly reducing data movement, easing the load on downstream analytics systems, and allowing large-scale analysis on modest platforms—even laptops—while accelerating time to insight.
An important enabler of this design is the ability for pNFS clients to identify which pNFS data server holds a given file, allowing queries to be routed correctly. Complementing this is a recent Linux kernel enhancement that transparently localizes remote pNFS reads when a file is detected to be local. Together, these capabilities enable efficient query offload without exposing application code to internal filesystem structures—preserving abstraction boundaries and enforcing standard POSIX permissions. We demonstrate this architecture in a real-world scientific visualization pipeline, modified to leverage pushdown to query large-scale simulation data stored in Parquet, a popular columnar format the lab is adopting for its future workloads. We conclude with key performance results and future opportunities.
While a host has been able to address NVMe device memory using Controller Memory Buffer (CMB) and Persistent Memory Region (PMR), that memory has never been addressable by NVMe commands. NVMe introduced the Subsystem Local Memory IO Command Set (SLM), which allowed NVMe device memory to be addressable by NVMe commands; however, this memory could not be addressed by the host using host memory addresses. A new technical proposal is being developed by NVM Express that would allow SLM to be assigned to a host memory address range. We will describe the architecture of this new NVMe feature and discuss the benefits and use cases that host addressable SLM enables.
Storage systems leverage data transformations such as compression, checksums, and erasure coding. These transformations are necessary to save on capacity and protect against data loss. These transformations however are both memory bandwidth and CPU intensive. This is leading to a large disparity between the performance from the storage software layers and the storage devices backing the data. This disparity only continues to grow as NVMe devices provide increasing bandwidth with each new PCIe generation. Computational storage devices (accelerators) provide a path forward by offloading these resource intensive transformations to hardware designed to accelerate operations.
However, integrating these devices in storage system software stacks has been a challenge: Each accelerator has its own custom API that must be integrated directly into the storage software. This leads to challenges in supporting different accelerators and maintaining custom code for each. This challenge has been solved by the Data Processing Unit Services Module (DPUSM) kernel module that provides a uniform API for storage software stacks to communicate with any accelerator. The storage software layers leverage the DPUSM API, and accelerator vendors can write code specific to their device through the API. This separation allows for accelerators to seamlessly integrate with storage system software. This talk will highlight how the DPUSM is being leveraged with the Zettabyte File System (ZFS) through the ZFS Interface for Accelerators (Z.I.A.). ZFS can now use different accelerators for data transformations that can lead to a 16x speed up in performance.