Pushdown Analytics on pNFS: Enabling Efficient Scientific Insight with Open Tools | SNIA

Large-scale simulations at Los Alamos can produce petabytes of data per timestep, yet the scientific focus often lies in narrow regions of interest—like a wildfire’s leading edge. Traditional HPC tools read entire datasets to extract these key features, resulting in significant inefficiencies in time, energy, and resource usage. To address this, Los Alamos—in collaboration with Hammerspace and SK hynix—is leveraging computational storage to process data closer to its source, enabling selective access to high-value information.

This talk introduces a new pushdown architecture for pNFS, built on open-source tools such as Presto, DuckDB, Substrait, and Apache Arrow. It distributes complex query execution across application and storage layers, significantly reducing data movement, easing the load on downstream analytics systems, and allowing large-scale analysis on modest platforms—even laptops—while accelerating time to insight.

An important enabler of this design is the ability for pNFS clients to identify which pNFS data server holds a given file, allowing queries to be routed correctly. Complementing this is a recent Linux kernel enhancement that transparently localizes remote pNFS reads when a file is detected to be local. Together, these capabilities enable efficient query offload without exposing application code to internal filesystem structures—preserving abstraction boundaries and enforcing standard POSIX permissions.

We demonstrate this architecture in a real-world scientific visualization pipeline, modified to leverage pushdown to query large-scale simulation data stored in Parquet, a popular columnar format the lab is adopting for its future workloads. We conclude with key performance results and future opportunities.

Bonus Content

Off

SDC 2025

Zoom Meeting Completed

Off

Main Speaker / Moderator

Qing Zheng

Track

General Session