Drive Health Monitor (DHM) for Drives On-Prem (or core data center) and Cloud

Abstract

This paper looks back at the analysis that has been done for the drive wear-out issue on for the different E-series array systems running at different customers’ sites and uses that data to give more specific guidance on thresholds for a preemptive drive removal. Motivation of DHM: Customers with old ventage or Refurbished drive replacement may experience a data loss event and continues to see a high drive failure rate >5% AFR (Annual Failure Rate). Storage vendors expect to see high fallout rates across the hard drive population as they age. The high utilization and the age of the drives are likely to continue and possibly increase the drive failure rate.

Failures are as such:

Outages: Loss of access
Performance impact
Possible data loss

Show user how DHM deploy a proactive (manual + automatic) drive removal process to predict bad drive early and reduce unexpected failures
Show user how advanced drive monitoring algorithm based on drive error metrics and scoring (offline monitoring) works
Show the user what are and why DHM has three prong approach to analyze the drive health static (Drive Statistics, Errors observed by controller, and SPFA thresholds )
Describe in details the design of DHM via a block diagram and how the thresholds are determined for DHM to predict bad drive early and reduce unexpected failures
Graphically shares plots and scoring tables of historical data for different drives monitored by DHM and share what was the expected life time for the Recommended Drive Removal

Download PDF

Abstract

Learning Objectives