Scientific Data Powered by User-Defined Functions | SNIA

Abstract

Scientific data is commonly represented by static datasets. There is a myriad of sources for such data: snapshots of the Earth, as captured by satellites, sonars, and laser scanners; the output of simulation models; the aggregation of existing datasets, and more. Some of the problems faced by consumers of that data relate to data transformation (e.g., harmonizing data layout and format), inefficient data transfer over the network (for instance, the creation of derivative data leads to even larger datasets), expensive data processing pipelines (that often preprocess data in advance, hoping that the produced data will be consumed at some point by the applications), CPU occupation, and difficulty to track data lineage (especially when data processing scripts are lost for good). This talk will present a pluggable extension to the HDF5 file format, used extensively by the scientific community, to enable the attachment of user-defined functions that help solve many of the aforementioned problems. Dubbed HDF5-UDF, the project allows scripts written in Lua, C/C++, and Python to be attached into existing HDF5 files. Such scripts are disguised as regular datasets that, once read, execute the associated source code and produce data in a format that existing applications already know how to consume. HDF5-UDF has been designed from scratch to enable seamless integration with hardware accelerators. Its memory allocation model allows for input datasets (i.e., datasets that a user-defined function depends on) to be allocated in device memory (such as a SmartSSD) and the backend architecture enables the incorporation of different programming languages and compiler suites (such as Xilinx's Vitis compiler or Nvidia's nvcc). The talk will present current and future work in the context of computational storage. To date, HDF5-UDF is tied to HDF5, but it is feasible to port this work to other formats such as TIFF (and GeoTIFF). Some guidance will be provided for those wishing to experiment with different data formats. Last, but not least, HDF5-UDF is open source. All examples shown in the presentation are readily available for downloading, modification, and testing. Please visit https://github.com/lucasvr/hdf5-udf and https://hdf5-udf.readthedocs.io for more details on the project covered by this talk.

Learn about challenges in data processing pipelines
Present data virtualization techniques
Show how software stacks are embracing computational storage