Sorry, you need to enable JavaScript to visit this website.

Discussion and Analysis of the MLPerf Storage Benchmark Suite and AI Storage Workloads

Submitted by diegonika on

Storage for AI is rapidly changing: Checkpointing becomes more important as clusters scale to more accelerators, managing large KV-Caches from LLM queries shifts inference bottlenecks to storage, accessing relevant data via VectorDB similarity searches drives small IOs for nearly every query, and future applications may require wildly different storage architectures

Near Data Compute for AI

Submitted by diegonika on

With the growth of AI, the necessity for compute near data is becoming more critical. Work is going on in SNIA and OCP to define architectures, frameworks, and APIs to facilitate Near Data Compute (NDC). This presentation will walk through the developments in both of these organizations to help you understand what standardization is doing to make NDC a reality for your system. Compute now encompasses In Memory Processing, Near Memory Processing, DPUs, and Computational Storage.

Towards building flexible, efficient and resilient training with adaptive checkpointing on AMD GPU platforms.

Submitted by diegonika on

Generative AI training is rapidly scaling in model size, data volume, and sequence length, requiring multiple instances for larger models. Distributed and parallel training strategies partition the training state across GPUs to support large-scale model training. As models and datasets grow, scaling infrastructure becomes a critical challenge. However, as AI infrastructure scales, the Mean Time Between Failures (MTBF) decreases, leading to more frequent job failures. Efficient recovery from failures is crucial, especially for resuming AI training.

Storage for AI in Public Clouds: Case Study of Vela in IBM Cloud

Submitted by diegonika on

Over the last few years IBM has built several AI-HPC clusters to experiment with and train its Large Language Models (LLMs). One of the clusters – Cloud Vela – is especially notable because it explored less conventional approaches for building HPC clusters. Vela relies on public cloud infrastructure, runs in VMs, uses Ethernet, and relies on container orchestrator (Kubernetes) to manage workloads and resources. Training jobs submitted by data scientists produce two types of I/O traffic: 1) reading training data and 2) writing large periodic checkpoints.

Pushdown Analytics on pNFS: Enabling Efficient Scientific Insight with Open Tools

Submitted by diegonika on

Large-scale simulations at Los Alamos can produce petabytes of data per timestep, yet the scientific focus often lies in narrow regions of interest—like a wildfire’s leading edge. Traditional HPC tools read entire datasets to extract these key features, resulting in significant inefficiencies in time, energy, and resource usage. To address this, Los Alamos—in collaboration with Hammerspace and SK hynix—is leveraging computational storage to process data closer to its source, enabling selective access to high-value information.

CDMI 3.0: Standardized Management of any URI-accessible Resource

Submitted by diegonika on

"CDMI 3.0 is the third major revision of the Cloud Data Management Interface, which provides a standard for discovery and declarative data management of any URI-accessible data resource, such as LUNs, files, objects, tables, streams, and graphs. Version 3 of the standard reorganizes the specification around data resource protocol ""exports"", data resource declarative ""metadata"", and adds new support for ""rels"", which describe graph relationships between data resources.

Chiplets, UCIe, Persistent Memory, and Heterogeneous Integration: The Processor Chip of the Future!

Submitted by diegonika on

Chiplets have become a near-overnight success with today’s rapid-fire data center conversion to AI.  But today’s integration of HBM DRAM with multiple SOC chiplets is only the very beginning of a larger trend in which multiple incompatible technologies will adopt heterogeneous integration to connect new memory technologies with advanced logic chips to provide both significant energy savings and vastly-improved performance at a reduced price point.

Storage Devices for the AI Data Center

Submitted by diegonika on

The transformational launch of GPT-4 has accelerated the race to build AI data centers for large-scale training and inference. While GPUs and high-bandwidth memory are well-known critical components, the essential role of storage devices in AI infrastructure is often overlooked. This presentation will explore the AI processing pipeline within data centers, emphasizing the crucial role of storage devices such as SSDs in compute and storage nodes. We will examine the characteristics of AI workloads to derive specific requirements for flash storage devices and controllers.

CXL Memory in Windows

Submitted by diegonika on

In this presentation, we will present the architecture of CXL memory in Windows and describe the support that will be available. We will describe the possible usages of CXL memory, the RAS workflows and the developer interfaces available to use CXL memory.

Subscribe to