Sorry, you need to enable JavaScript to visit this website.

Take a Break: A Deep Dive into Checkpointing

Abstract

Checkpointing is a critical component of large-scale AI training, enabling resilience, efficiency, and flexibility in modern machine learning workflows. We will explore the issues that come with the scaling up of training model sizes and take a deep dive into the workload pattern of checkpointing and how the right storage can reduce the downtime of AI training.


 

DOWNLOAD PDF