Distributed Storage on Limping Hardware | SNIA

Abstract

It is easy to design storage systems that assume nothing bad ever happens. It is marginally harder to design one that assumes nodes are either available or not. What is difficult is designing storage systems that handle how nodes fail in the real world. Such "limping nodes" may respond slowly, occasionally, or unpredictably; they are neither entirely failed nor entirely healthy. This presentation covers the mechanisms we developed for dealing with limping nodes in a distributed storage system. These techniques allow limping nodes to be tolerated with negligible impact on performance, latency, or reliability. We introduce some of the intelligent writing techniques we created for this purpose, which include: write thresholds, impatient writes, optimistic writes, real-time writes, and lock-stealing writes.

Learning Objectives

How nodes fail in the real world
What can happen if a distributed storage system doesn't handle limping nodes well
Techniques we have developed for better handling of limping nodes, and the results we have obtained