Resiliency for AI workloads | SNIA

Abstract

When your AI jobs fail at scale, where does the blame really lie—the model, the hardware, or the layers underneath that move and manage data? This session dives into how modern storage and execution infrastructure can turn fragile AI pipelines into resilient, self-healing systems.

AI Checkpointing and Storage Hierarchy

AI training state spans terabytes across thousands of devices—far beyond traditional workloads. Effective checkpointing should ideally exploit the storage hierarchy (DRAM, NVMe, parallel, object storage), with an intelligent layer automating tiering, async flushing, and ensure consistency across application ranks.

Checkpointing Alone Isn't Enough

At scale, coarse checkpoint intervals risk losing hours of compute; fine intervals create continuous I/O pressure. Fine-grained retry isolates failures to a shard or pipeline stage, avoiding full restarts—but requires the execution framework and storage system to jointly maintain recovery state.

Shared Responsibility

Resilience spans the entire AI pipeline—from data ingestion to model serving. Storage, execution frameworks, communication libraries, and applications must each provide well-defined guarantees. We'll outline what an ideal resilient AI storage and execution stack should look like.

Clarete Crasta

Principal Engineer

HPE