Sorry, you need to enable JavaScript to visit this website.

SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA

Towards building flexible, efficient and resilient training with adaptive checkpointing on AMD GPU platforms.

Abstract

Generative AI training is rapidly scaling in model size, data volume, and sequence length, requiring multiple instances for larger models. Distributed and parallel training strategies partition the training state across GPUs to support large-scale model training. As models and datasets grow, scaling infrastructure becomes a critical challenge. However, as AI infrastructure scales, the Mean Time Between Failures (MTBF) decreases, leading to more frequent job failures. Efficient recovery from failures is crucial, especially for resuming AI training. Existing systems offer limited support for reconfiguring parallelism mid-training, slowing progress during hardware failures or GPU re-allocation due to tightly coupled distributed checkpoints.

In this talk, we present Universal Checkpointing (UCP), a novel system enabling flexible and efficient generative AI training with reconfigurable parallelism on large-scale AMD GPU clusters. UCP addresses storage and memory performance challenges through careful hardware and software architecture considerations. Through under-the-hood analysis of the PyTorch GPU-Storage data path, UCP achieves optimal performance between AMD GPU clusters and high-performance remote storage systems.  Through our optimizations, UCP enables reconfiguration for broad set of popular parallelism strategies across varying GenAI models (size and type) with minimal reconfiguration costs, enhancing flexibility and resilience. These findings have wide applicability across the entire AI data pipelines, for instance, reducing the cold start overhead during inference or loading checkpoints to downstream post-training tasks, such as Supervised Fine-Tuning or Reinforcement Learning, etc.

This is a joint effort between UIUC and AMD.

Learning Objectives