SkyhookDM: An Arrow-Native Storage System

Library Content Type:
Publish Date: 
Wednesday, September 29, 2021
Event Name: 
Event Track:

With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec the CPU has become the bottleneck, trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the internals. SkyhookDM introduces a new design paradigm for building computational storage systems by extending existing storage systems with plugins. Our design allows extending programmable object storage systems by embedding existing and widely used data processing frameworks and access libraries into the storage layer with minimal modifications. In this approach data processing frameworks and access, libraries can evolve independently from storage systems while leveraging the scale-out, availability, and failure recovery properties of distributed storage systems. SkyhookDM is a data management system that allows data processing tasks to the storage layer to reduce client-side resources in terms of CPU, memory, and network traffic for increased scalability and reduced latency. On the storage side, SkyhookDM uses the existing Ceph object class mechanism to embed Apache Arrow libraries in the Ceph OSDs and uses C++ methods to facilitate data processing within the storage nodes. On the client-side, the Arrow Dataset API is extended with a new file format that bypasses the Ceph filesystem layer and invokes storage side ceph object class methods on objects that make up a file in the filesystem layer. SkyhookDM currently supports Parquet as its object storage format but support for other file formats can be added easily due to the use of Arrow access libraries.

  • Scale out data processing with compute offloading
  • Bypass filesystem layer to access and manipulate underlying objects in object stores
  • Extend storage systems with object class mechanisms
  • Embed and use access libraries like Arrow for data processing tasks
  • File layout for storing big data file formats in distributed object stores

Watch video: