Abstract
Implementing zero-downtime upgrades of live cloud storage systems is a surprisingly complex problem that has proven difficult to completely automate. Beyond merely preventing availability outages, the upgrade process must proactively detect and repair errors, prevent cascading failures from leading to data loss, be resilient in the face of transient network communication errors, and gracefully handle disk failures that occur during device upgrades. At the scale of today’s deployments, occasional human intervention to help the process along is tolerable. With hundreds of thousands of devices comprising multi-exabyte, single-system deployments on the horizon though, completely automated solutions are required. Please join us as we discuss the challenges inherent to upgrading cloud storage systems and how those challenges may be overcome at scale.
Learning Objectives
The challenges involved in cloud storage upgrades
Techniques to address those challenges
Ramifications at scale