When your AI jobs fail at scale, where does the blame really lie—the model, the hardware, or the layers underneath that move and manage data? This session dives into how modern storage and execution infrastructure can turn fragile AI pipelines into resilient, self-healing systems.
AI Checkpointing and Storage Hierarchy
AI training state spans terabytes across thousands of devices—far beyond traditional workloads. Effective checkpointing should ideally exploit the storage hierarchy (DRAM, NVMe, parallel, object storage), with an intelligent layer automating tiering, async flushing, and ensure consistency across application ranks.
Checkpointing Alone Isn't Enough
At scale, coarse checkpoint intervals risk losing hours of compute; fine intervals create continuous I/O pressure. Fine-grained retry isolates failures to a shard or pipeline stage, avoiding full restarts—but requires the execution framework and storage system to jointly maintain recovery state.
Shared Responsibility
Resilience spans the entire AI pipeline—from data ingestion to model serving. Storage, execution frameworks, communication libraries, and applications must each provide well-defined guarantees. We'll outline what an ideal resilient AI storage and execution stack should look like.