SkyhookDM: An Arrow-Native Storage System | SNIA

With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec the CPU has become the bottleneck, trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the internals. SkyhookDM introduces a new design paradigm for building computational storage systems by extending existing storage systems with plugins. Our design allows extending programmable object storage systems by embedding existing and widely used data processing frameworks and access libraries into the storage layer with minimal modifications. In this approach data processing frameworks and access, libraries can evolve independently from storage systems while leveraging the scale-out, availability, and failure recovery properties of distributed storage systems. SkyhookDM is a data management system that allows data processing tasks to the storage layer to reduce client-side resources in terms of CPU, memory, and network traffic for increased scalability and reduced latency. On the storage side, SkyhookDM uses the existing Ceph object class mechanism to embed Apache Arrow libraries in the Ceph OSDs and uses C++ methods to facilitate data processing within the storage nodes. On the client-side, the Arrow Dataset API is extended with a new file format that bypasses the Ceph filesystem layer and invokes storage side ceph object class methods on objects that make up a file in the filesystem layer. SkyhookDM currently supports Parquet as its object storage format but support for other file formats can be added easily due to the use of Arrow access libraries.

Bonus Content

Off

2021-09-29

PDF Presentation

SNIA-SDC21-Chakraborty-Maltzahn-Skyhookdm-An-Arrow-Native-Storage-System.pdf (1.39 MB)

Presentation Type

Presentation

Learning Objectives

Scale out data processing with compute offloading
Bypass filesystem layer to access and manipulate underlying objects in object stores
Extend storage systems with object class mechanisms
Embed and use access libraries like Arrow for data processing tasks
File layout for storing big data file formats in distributed object stores

Display Order

132

Start Date/Time

Wed, 09/29/2021 - 12:00

YouTube Video ID

pQRK_U7HHyw

Zoom Meeting Completed

Off

Main Speaker / Moderator

Jayjeet Chakraborty

Track

Computational Storage

Webform Submission ID