Abstract
RAID is the standard approach for fault tolerance among multiple disk drives and has been around for decades. However, new hardware trends, including the advent of hard disk drives (HDDs) with huge capacity, widely adoption of solid state drives (SSDs) with fast I/O, etc., have created new opportunities to optimize fault tolerance schemes. Windows now introduces a new fault tolerance scheme in its Storage Spaces technology. The new scheme is developed based on a novel erasure coding technology, called Local Reconstruction Code (LRC). Compared to RAID under same durability metric, LRC significantly reduces rebuild time, while still keeping storage overhead very low. In addition, LRC offers much more flexibility in balancing rebuild time and storage overhead. The presentation will provide an overview of the Windows Storage Spaces technology, cover the design of its fault tolerance mechanism, discuss the implementation of LRC in detail and share experiences learned from real-world workloads.
Learning Objectives
Refresh my knowledge of erasure coding. (Some knowledge is assumed – this is *not* a tutorial on Erasure Coding. For basics, refer to the tutorial at USENIX FAST 2013 – “Erasure Coding for Storage Applications”, by Plank and Huang. )
Get an overview of Windows Storage Spaces technology and its fault tolerance mechanism.
Understand the implementation of LRC and its benefits in clustered storage systems.