Resilience at Scale in the Distributed Storage Cloud (2011) | SNIA

Abstract

The cloud is a diffuse and dynamic place to store both data and applications, unbounded by data centers and traditional IT constraints. However, adequate protection of all this information still requires consideration of fault domains, failure rates and repair times that are rooted in the same data centers and hardware we attempt to meld into the cloud. This talk will address the key challenges to a truly global data store, using examples from the Atmos cloud-optimized object store. We discuss how flexible replication and coding allow data objects to be distributed and where automatic decisions are necessary to ensure resiliency at multiple levels. We will discuss the impact of using a virtualized infrastructure inside the cloud, noting where this does and does not change the resiliency characteristics of the complete system and how it changes the design reasoning compared to purely physical hardware. Automatic placement of data and redundancy across a distributed storage cloud must ensure resiliency at multiple levels, i.e., from a single node to an entire site. System expansion must occur seamlessly without affecting data reliability and availability. All these features together ensures data protection while fully exploiting the geographic dispersion and platform adaptability promised by the cloud.

Learning Objectives

Understand replication and coding as used in the cloud
Understand the basics of fault domains and resilient design
Separate the promise from the reality of cloud storage
Understand resilience of virtual vs. physical infrastructure
Understand how automated, policy-driven storage increases data resiliency