Debugging of Flash Issues Observed in Hyperscale Environment at Scale

Salon VI

Mon Sep 12 | 11:35am

Abstract

A deep dive of the methodology and tooling that we use at Meta, to improve debuggability of failures in the datacenters, especially for failures on components like SSDs where privacy requirements might prohibit us from sending the components back for FA or add custom instrumentations in our datacenter. In particular, we will talk about how the tool tracewatch coupled with Latency Monitoring log page helps us trigger trace collection on failures using BPF based triggers. We will present the retrace tool which can then be used to analyze the captures in a variety of format, convert between the different formats and filter down to the stack of a single I/O from application layer down to the drive. We will present dialog, our collection mechanism for file system based logging, the sanitization process, etc. Finally we will talk about ways in which we’re collaborating with the industry to design efficient logging built into flash drives.

Flash reliability at scale in a hyperscale environment
Types of Flash issues we see in a production enviornment
Methods to efficiently debug some of the most challenging application level issues seen in production
Design of better logging for efficient debugging

Vineet Parekh

Hardware Systems Engineer

Abstract

Learning Objectives