Submitted by Anonymous (not verified) on

A deep dive of the methodology and tooling that we use at Meta, to improve debuggability of failures in the datacenters, especially for failures on components like SSDs where privacy requirements might prohibit us from sending the components back for FA or add custom instrumentations in our datacenter. In particular, we will talk about how the tool tracewatch coupled with Latency Monitoring log page helps us trigger trace collection on failures using BPF based triggers. We will present the retrace tool which can then be used to analyze the captures in a variety of format, convert between the different formats and filter down to the stack of a single I/O from application layer down to the drive. We will present dialog, our collection mechanism for file system based logging, the sanitization process, etc. Finally we will talk about ways in which we’re collaborating with the industry to design efficient logging built into flash drives.

Bonus Content
Off
Presentation Type
Presentation
Learning Objectives
  • Flash reliability at scale in a hyperscale environment
  • Types of Flash issues we see in a production enviornment
  • Methods to efficiently debug some of the most challenging application level issues seen in production
  • Design of better logging for efficient debugging
Display Order
382
Start Date/Time
End Date/Time
YouTube Video ID
gSfVhBc4g-E
Zoom Meeting Completed
Off
Main Speaker / Moderator
Speakers
Room Location
Salon VI
Salon VII
Webform Submission ID
591