The growing complexity and extended context lengths for inferencing workloads in AI projects have added a costly level of complexity to implementing such initiatives, resulting in increasing I/O required to push data back and forth across the network. This leads organizations to need higher performing storage & faster networks to feed the compute clusters, and get better utilization of their infrastructure.
Storage for AI is rapidly changing: Checkpointing becomes more important as clusters scale to more accelerators, managing large KV-Caches from LLM queries shifts inference bottlenecks to storage, accessing relevant data via VectorDB similarity searches drives small IOs for nearly every query, and future applications may require wildly different storage architectures. The MLPerf Storage v2.0 Benchmark Results were just released and the v2.5 suite is under active development.
Cloud object storage systems have been built to satisfy simple storage workloads where traditional POSIX semantics are sacrificed for simplicity and scalability. With AI and analytics workloads migrating towards hyperscale cloud computing, object storage users are increasingly requesting file-oriented access to their data.
Generative AI training is rapidly scaling in model size, data volume, and sequence length, requiring multiple instances for larger models. Distributed and parallel training strategies partition the training state across GPUs to support large-scale model training. As models and datasets grow, scaling infrastructure becomes a critical challenge. However, as AI infrastructure scales, the Mean Time Between Failures (MTBF) decreases, leading to more frequent job failures. Efficient recovery from failures is crucial, especially for resuming AI training.
CDMI 3.0 is the third major revision of the Cloud Data Management Interface, which provides a standard for discovery and declarative data management of any URI-accessible data resource, such as LUNs, files, objects, tables, streams, and graphs. Version 3 of the standard reorganizes the specification around data resource protocol "exports", data resource declarative "metadata", and adds new support for "rels", which describe graph relationships between data resources.