Checkpointing is a critical component of large-scale AI training, enabling resilience, efficiency, and flexibility in modern machine learning workflows. We will explore the issues that come with the scaling up of training model sizes and take a deep dive into the workload pattern of checkpointing and how the right storage can reduce the downtime of AI training.