Rethinking Distributed Storage System Architecture for Fast Storage Devices

Library Content Type:
Publish Date: 
Tuesday, September 22, 2020
Event Name: 
Event Track:

Storage devices have been drastically evolved for the last decade. However, the advent of fast storage technology poses unprecedented challenges in the software stack; the performance bottleneck is shifted from storage devices to software. For instance, a modern large-scale storage system usually consists of a number of storage nodes connected via network and they communicate with each other all the time for cluster-level consistency and availability. Furthermore, it is common for a storage server to have multiple NVMe SSDs, which consequently increases the amount of work for I/O processing by multiple times. Because of all these reasons, storage nodes tend to lack CPU resources especially when handling small random I/Os. In this talk, we propose a new design of distributed storage system for fast storage devices that focuses on minimizing CPU consumption while achieving both higher IOPS and lower latency. Our design is based on the following three ideas: Lightweight data store: Backend data store should be as lightweight as possible. We should rethink the trend of accelerating I/O at the cost of burdening the host’s CPU. For example, LSM-tree-based key-value stores sequentialize I/Os for better random write performance and for efficient device-level GC. However, this requires costly compaction process, which is known to consume non-negligible CPU power. To alleviate the burden on the host side, we have prototyped in-place update based data store. It is also partitioned so that they can be accessed in parallel without synchronization. Thread control: RTC model is a well-known technique to lower I/O latency by mitigating context switching overhead and inefficient cache operation. However, without an efficient thread control and careful partitioning of the lock space, a latency critical task would be blocked by a slow non-critical task. To avoid this problem, we propose a priority-based run-to-completion model. It runs latency-critical tasks on dedicated CPU cores, while others on remaining shared cores. Mitigating replication overhead: We propose a replication method which relies on our NVMeoF-based storage solution. Our storage solution has enough computation power to process more works than the conventional storage while providing more reliability by internal redundancy mechanism. With our storage solution, we present a way that offloads replication works to the NVMeoF-based storage solution without losing fault tolerance while reducing CPU consumption: decoupling fault domain between compute and storage node adding new mapping to existing storage system. For performance evaluation, we have implemented our design based on Ceph. Compared to the existing approach, our prototype system delivers significant performance improvement for small random write I/Os.

Learning Objectives

Distirubted storage system,Performance

Watch video: