Sorry, you need to enable JavaScript to visit this website.

SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA

Beyond Throughput: Benchmarking Storage for the Complex I/O Patterns of AI with MLPerf Storage and DLIO

Abstract

Training state-of-the-art AI models, including LLMs, creates unprecedented demands on storage systems that go far beyond simple throughput. The I/O patterns in these workloads—characterized by heavy metadata operations, multi-threaded asynchronous I/O, random access, and complex data formats—present a significant bottleneck that traditional benchmarks fail to capture. This disconnect leads to inefficient storage design and procurement for critical AI infrastructure. To address this challenge, the MLPerf Storage Working Group has been developing a comprehensive benchmark suite to realistically model these complex I/O behaviors. In this session, we present a deep dive into this effort, focusing on the DLIO benchmark. We will detail the technical lessons learned from our benchmark development and previous submission cycles, including the critical I/O access patterns we identified in training pipelines like data loading and model checkpointing. Attendees will leave with actionable insights on how to better design and configure storage hardware and software stacks to support AI workloads. We will share our analysis of I/O behavior that directly informs system architecture and demonstrate how to leverage our open-source tools to identify and resolve storage bottlenecks in your own AI environments.

Learning Objectives

Identify I/O Bottlenecks: Attendees will be able to identify the specific I/O bottlenecks in AI training workloads that traditional storage benchmarks overlook, such as metadata contention and asynchronous I/O patterns. Evaluate Storage for AI: Attendees will learn how to use the DLIO benchmark to realistically evaluate and compare storage system performance for AI and LLM workloads. Optimize System Design: Attendees will understand how to apply insights from MLPerf Storage results to better design and configure storage hardware and software stacks for AI training clusters.