SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
High-performance computing data centers supporting large-scale simulation applications can routinely generate a large amount of data. To minimize time-to-result, it is crucial that this data be promptly absorbed, processed, and potentially even multidimensionally indexed so that it can be efficiently retrieved when the scientists need it for insights. Currently, despite a transition from HDDs to using all-flash based storage systems in hot storage tiers for a boost in raw storage bandwidth as many recently deployed systems have done, bottlenecks still exist due to legacy software, severe server CPU and memory bandwidth limitations for certain data-intensive operations, and excessive data movement. Computational storage, with its ability to map and distribute storage functions to various computing units along the data processing path, offers opportunities to overcome existing storage system bottlenecks to vastly improve performance and cost. In this talk, we will discuss various computational storage efforts carried out at Los Alamos National Laboratory in collaboration with partners including Aeon Computing, Eideticom, NVIDIA, SK hynix, and Seagate. We will explore topics like transparent ZFS I/O pipeline offloads, analytics acceleration with flash key-value based storage devices, and in-drive SQL like query processing in an erasure coded data lake tier. We will then conclude by discussing lessons learned, next steps, and the need for an open, standards-based approach for computational storage in the form of an object storage system to ease development, adoption, and innovation.
Large-scale simulations at Los Alamos can produce petabytes of data per timestep, yet the scientific focus often lies in narrow regions of interest—like a wildfire’s leading edge. Traditional HPC tools read entire datasets to extract these key features, resulting in significant inefficiencies in time, energy, and resource usage. To address this, Los Alamos—in collaboration with Hammerspace and SK hynix—is leveraging computational storage to process data closer to its source, enabling selective access to high-value information. This talk introduces a new pushdown architecture for pNFS, built on open-source tools such as Presto, DuckDB, Substrait, and Apache Arrow. It distributes complex query execution across application and storage layers, significantly reducing data movement, easing the load on downstream analytics systems, and allowing large-scale analysis on modest platforms—even laptops—while accelerating time to insight.
An important enabler of this design is the ability for pNFS clients to identify which pNFS data server holds a given file, allowing queries to be routed correctly. Complementing this is a recent Linux kernel enhancement that transparently localizes remote pNFS reads when a file is detected to be local. Together, these capabilities enable efficient query offload without exposing application code to internal filesystem structures—preserving abstraction boundaries and enforcing standard POSIX permissions. We demonstrate this architecture in a real-world scientific visualization pipeline, modified to leverage pushdown to query large-scale simulation data stored in Parquet, a popular columnar format the lab is adopting for its future workloads. We conclude with key performance results and future opportunities.
While a host has been able to address NVMe device memory using Controller Memory Buffer (CMB) and Persistent Memory Region (PMR), that memory has never been addressable by NVMe commands. NVMe introduced the Subsystem Local Memory IO Command Set (SLM), which allowed NVMe device memory to be addressable by NVMe commands; however, this memory could not be addressed by the host using host memory addresses. A new technical proposal is being developed by NVM Express that would allow SLM to be assigned to a host memory address range. We will describe the architecture of this new NVMe feature and discuss the benefits and use cases that host addressable SLM enables.
Storage systems leverage data transformations such as compression, checksums, and erasure coding. These transformations are necessary to save on capacity and protect against data loss. These transformations however are both memory bandwidth and CPU intensive. This is leading to a large disparity between the performance from the storage software layers and the storage devices backing the data. This disparity only continues to grow as NVMe devices provide increasing bandwidth with each new PCIe generation. Computational storage devices (accelerators) provide a path forward by offloading these resource intensive transformations to hardware designed to accelerate operations.
However, integrating these devices in storage system software stacks has been a challenge: Each accelerator has its own custom API that must be integrated directly into the storage software. This leads to challenges in supporting different accelerators and maintaining custom code for each. This challenge has been solved by the Data Processing Unit Services Module (DPUSM) kernel module that provides a uniform API for storage software stacks to communicate with any accelerator. The storage software layers leverage the DPUSM API, and accelerator vendors can write code specific to their device through the API. This separation allows for accelerators to seamlessly integrate with storage system software. This talk will highlight how the DPUSM is being leveraged with the Zettabyte File System (ZFS) through the ZFS Interface for Accelerators (Z.I.A.). ZFS can now use different accelerators for data transformations that can lead to a 16x speed up in performance.