Resilience at Scale in the Distributed Storage Cloud (2013)

Library Content Type:
Publish Date: 
Tuesday, September 17, 2013
Event Name: 
Focus Areas:

The cloud is a diffuse and dynamic place to store both data and applications, unbounded by data centers and traditional IT constraints. However, adequate protection of all this information still requires consideration of fault domains, failure rates and repair times that are rooted in the same data centers and hardware we attempt to meld into the cloud. This talk will address the key challenges to a truly global data store, using examples from the Atmos cloud-optimized object store. We discuss how flexible replication and coding allow data objects to be distributed and where automatic decisions are necessary to ensure resiliency at multiple levels. Automatic placement of data and redundancy across a distributed storage cloud must ensure resiliency at multiple levels, i.e., from a single node to an entire site. System expansion must occur shamelessly without affecting data reliability and availability. All these features together ensures data protection while fully exploiting the geographic dispersion and platform adaptability promised by the cloud.

Learning Objectives

Learn how to build truly large distributed storage systems
Understand fault domains, failure considerations
Understand how to reason about data resilience at large scale