Beyond Throughput: Benchmarking Storage for the Complex I/O Patterns of AI with MLPerf Storage and DLIO
Training state-of-the-art AI models, including LLMs, creates unprecedented demands on storage systems that go far beyond simple throughput. The I/O patterns in these workloads—characterized by heavy metadata operations, multi-threaded asynchronous I/O, random access, and complex data formats—present a significant bottleneck that traditional benchmarks fail to capture. This disconnect leads to inefficient storage design and procurement for critical AI infrastructure.
To address this challenge, the MLPerf Storage Working Group has been developing a comprehensive benchmark suite to realistically model these complex I/O behaviors. In this session, we present a deep dive into this effort, focusing on the DLIO benchmark. We will detail the technical lessons learned from our benchmark development and previous submission cycles, including the critical I/O access patterns we identified in training pipelines like data loading and model checkpointing.
Attendees will leave with actionable insights on how to better design and configure storage hardware and software stacks to support AI workloads. We will share our analysis of I/O behavior that directly informs system architecture and demonstrate how to leverage our open-source tools to identify and resolve storage bottlenecks in your own AI environments.