| SNIA | Experts on Data

Activating Untapped Tier 0 Storage Within Your GPU- and CPU-based Compute Clusters

Read more about Activating Untapped Tier 0 Storage Within Your GPU- and CPU-based Compute Clusters

The growing complexity and extended context lengths for inferencing workloads in AI projects have added a costly level of complexity to implementing such initiatives, resulting in increasing I/O required to push data back and forth across the network. This leads organizations to need higher performing storage & faster networks to feed the compute clusters, and get better utilization of their infrastructure.

Samba 2025: Enterprise-Ready, Cloud-Optimized

Read more about Samba 2025: Enterprise-Ready, Cloud-Optimized

Samba is evolving to meet the demands of modern enterprise IT. The latest advancements bring critical SMB3 capabilities that boost scalability, reliability, and cloud readiness. With features like SMB over QUIC, Transparent Failover, and SMB3 Directory Leases now arriving, Samba is positioning itself as a robust solution for secure, high-performance file services across data centers and hybrid cloud environments. Learn how these enhancements can future-proof your infrastructure - without vendor lock-in.

Maximizing the Benefits of QLC Flash with Pure's Hyperscale Solution for Applications Ranging from AI/ML to HDD Displacement

Read more about Maximizing the Benefits of QLC Flash with Pure's Hyperscale Solution for Applications Ranging from AI/ML to HDD Displacement

As data growth accelerates in the age of AI, hyperscalers demand higher-capacity storage solutions that deliver a balanced combination of performance, power, and cost-effectiveness. In this session, we will present how QLC NAND-based Direct Flash Modules (DFM), in conjunction with Pure’s advanced software architecture, allows hyperscalers to take advantage of high-density, higher-performance, and reliable SSD storage compared to HDDs.

Discussion and Analysis of the MLPerf Storage Benchmark Suite and AI Storage Workloads

Read more about Discussion and Analysis of the MLPerf Storage Benchmark Suite and AI Storage Workloads

Storage for AI is rapidly changing: Checkpointing becomes more important as clusters scale to more accelerators, managing large KV-Caches from LLM queries shifts inference bottlenecks to storage, accessing relevant data via VectorDB similarity searches drives small IOs for nearly every query, and future applications may require wildly different storage architectures. The MLPerf Storage v2.0 Benchmark Results were just released and the v2.5 suite is under active development.

Blending Objects and Files in Google Cloud Storage

Read more about Blending Objects and Files in Google Cloud Storage

Cloud object storage systems have been built to satisfy simple storage workloads where traditional POSIX semantics are sacrificed for simplicity and scalability. With AI and analytics workloads migrating towards hyperscale cloud computing, object storage users are increasingly requesting file-oriented access to their data.

Towards Building Flexible, Efficient and Resilient Training with Adaptive Checkpointing on AMD GPU Platforms

Read more about Towards Building Flexible, Efficient and Resilient Training with Adaptive Checkpointing on AMD GPU Platforms

Generative AI training is rapidly scaling in model size, data volume, and sequence length, requiring multiple instances for larger models. Distributed and parallel training strategies partition the training state across GPUs to support large-scale model training. As models and datasets grow, scaling infrastructure becomes a critical challenge. However, as AI infrastructure scales, the Mean Time Between Failures (MTBF) decreases, leading to more frequent job failures. Efficient recovery from failures is crucial, especially for resuming AI training.

Storage for AI in Public Clouds: Case Study of Vela in IBM Cloud

Read more about Storage for AI in Public Clouds: Case Study of Vela in IBM Cloud

Over the last few years IBM has built several AI-HPC clusters to experiment with and train its Large Language Models (LLMs). One of the clusters – Cloud Vela – is especially notable because it explored less conventional approaches for building HPC clusters. Vela relies on public cloud infrastructure, runs in VMs, uses Ethernet, and relies on container orchestrator (Kubernetes) to manage workloads and resources. Training jobs submitted by data scientists produce two types of I/O traffic: 1) reading training data and 2) writing large periodic checkpoints.

CDMI 3.0: Standardized Management of any URI-accessible Resource

Read more about CDMI 3.0: Standardized Management of any URI-accessible Resource

CDMI 3.0 is the third major revision of the Cloud Data Management Interface, which provides a standard for discovery and declarative data management of any URI-accessible data resource, such as LUNs, files, objects, tables, streams, and graphs. Version 3 of the standard reorganizes the specification around data resource protocol "exports", data resource declarative "metadata", and adds new support for "rels", which describe graph relationships between data resources.

Chiplets, UCIe, Persistent Memory, and Heterogeneous Integration: The Processor Chip of the Future!

Read more about Chiplets, UCIe, Persistent Memory, and Heterogeneous Integration: The Processor Chip of the Future!

Chiplets have become a near-overnight success with today’s rapid-fire data center conversion to AI. But today’s integration of HBM DRAM with multiple SOC chiplets is only the very beginning of a larger trend in which multiple incompatible technologies will adopt heterogeneous integration to connect new memory technologies with advanced logic chips to provide both significant energy savings and vastly-improved performance at a reduced price point.

Storage Devices for the AI Data Center

Read more about Storage Devices for the AI Data Center

The transformational launch of GPT-4 has accelerated the race to build AI data centers for large-scale training and inference. While GPUs and high-bandwidth memory are well-known critical components, the essential role of storage devices in AI infrastructure is often overlooked. This presentation will explore the AI processing pipeline within data centers, emphasizing the crucial role of storage devices such as SSDs in compute and storage nodes. We will examine the characteristics of AI workloads to derive specific requirements for flash storage devices and controllers.

Subscribe to