Abstract
Modeling the risk of disk failure in storage arrays is important for planning around and remediation of disk problems. In this talk, we continue the development of work presented in “Characterizing the Evolution of Disk Attributes via Absorbing Markov Chains” by Rachel Traylor at SDC 2016. We review stochastic modeling of single disk failure risk and extend the modeling to RAID group and system level risks. We demonstrate the application of our models trained on a large data collection of historical disk level behavioral metrics to estimate the risk of disk, RAID group, and system failures in production systems. Attendees should have a basic understanding of probability, disks, and storage arrays.