SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
TBD
The Grand Unified File Index (GUFI) is a toolset built upon established technologies that helps manage massive amounts of filesystem metadata, which is becoming more and more important as the amount of data needing to be managed grows. GUFI provides powerful query capabilities via SQL that far surpass those provided by standard POSIX tools such as find, ls, and stat. GUFI provides results much faster than POSIX tools by virtue of running in parallel rather than being single threaded. GUFI allows for both filesystem administrators and users to query the same index without loss of security. All of this capability is available to all filesystems that are accessible via the POSIX filesystem interface (or can be mapped to it), removing the requirement of users needing to be experts with filesystem specific tools. In addition to general improvements to the codebase, recent developments have introduced a series of features that, when combined, makes GUFI a swiss army knife for data (in addition to metadata) processing while maintaining need-to-know: exposing GUFI trees as single tables via the SQLite virtual table mechanism allowing for SQLAlchemy to access GUFI trees, the ability to run arbitrary commands to return results as both individual values as well as whole tables, and the addition of SQLite-native vector embedding generation and tables allowing for nearest neighbor searches of user data. These additions have turned GUFI from a command line only tool to a full stack tool with powerful features also usable from fully featured front end GUIs. LA-UR-25-24971
As hardware layers in storage systems (such as network and storage devices) continue to increase in performance, it is vital that the IO software stack does not fall behind and become the bottleneck. Leveraging the capabilities of computational storage devices, such data processing units (DPUs), allows for the IO software stack to accelerate CPU and memory bandwidth constrained operations in order to fully take advantage of the storage hardware in the system. DPUs offer the ability to make storage systems more efficient but has proven difficult to integrate them into IO software stacks to produce production-quality accelerated storage systems. Los Alamos National Laboratory (LANL) has worked on streamlining the process of integrating DPUs in order to improve file system performance and reduce the storage footprint on storage devices. The Data Processing Unit Services Module (DPU-SVC) was created in order to provide a standardized interface for DPUs to interface with, as well as a standardized interface for DPU consumers to use. The initial DPU consumer targeted was the open-source Zettabyte File System (ZFS). ZFS is a commonly used backend for LANL’s parallel filesystems due to the rich feature set available for data transformations (compression, deduplication) and data integrity functions (checksums, erasure coding). The ZFS Interface for Accelerators (Z.I.A.), which uses the consumer facing standard interface of the DPU-SVC, was added into ZFS to allow for data to be moved out of ZFS and into DPUs. These changes allow for transparent acceleration of ZFS operations: users do not have to modify their applications in order to enjoy the benefits provided by DPUs in the storage system. Using DPUs in coordination with the ZFS code base can increase overall write throughput to LANL storage systems by a factor of 10 to 30 times current storage system performance. This increase in performance is achieved through moving ZFS operations that were originally implemented in software such as compression, checksumming, and erasure coding to hardware accelerated implementations, which in turn frees up CPU and memory bandwidth for user applications. This talk will present technical details of how the DPU-SVC and Z.I.A. accelerate the ZFS IO stack through attached DPUs. Results will also be presented showing write performance improvements from using these layers with LANL scientific data sets and storage systems. Some background knowledge of computational storage, ZFS, and kernel modules would be beneficial to the audience.