Sorry, you need to enable JavaScript to visit this website.

Take a Break: A Deep Dive into Checkpointing

Grand Mesa A-B

Wed Apr 29 | 4:40pm

Abstract

Checkpointing is a critical component of large-scale AI training, enabling resilience, efficiency, and flexibility in modern machine learning workflows. We will explore the issues that come with the scaling up of training model sizes and take a deep dive into the workload pattern of checkpointing and how the right storage can reduce the downtime of AI training.

Download PDF