Sorry, you need to enable JavaScript to visit this website.

Activating Untapped Tier 0 Storage Within Your GPU- and CPU-based Compute Clusters

Submitted by Anonymous (not verified) on

The growing complexity and extended context lengths for inferencing workloads in AI projects have added a costly level of complexity to implementing such initiatives, resulting in increasing I/O required to push data back and forth across the network. This leads organizations to need higher performing storage & faster networks to feed the compute clusters, and get better utilization of their infrastructure. 

Samba 2025: Enterprise-Ready, Cloud-Optimized

Submitted by Anonymous (not verified) on
Samba is evolving to meet the demands of modern enterprise IT. The latest advancements bring critical SMB3 capabilities that boost scalability, reliability, and cloud readiness. With features like SMB over QUIC, Transparent Failover, and SMB3 Directory Leases now arriving, Samba is positioning itself as a robust solution for secure, high-performance file services across data centers and hybrid cloud environments. Learn how these enhancements can future-proof your infrastructure - without vendor lock-in.

Maximizing the Benefits of QLC Flash with Pure's Hyperscale Solution for Applications Ranging from AI/ML to HDD Displacement

Submitted by Anonymous (not verified) on
As data growth accelerates in the age of AI, hyperscalers demand higher-capacity storage solutions that deliver a balanced combination of performance, power, and cost-effectiveness. In this session, we will present how QLC NAND-based Direct Flash Modules (DFM), in conjunction with Pure’s advanced software architecture, allows hyperscalers to take advantage of high-density, higher-performance, and reliable SSD storage compared to HDDs.

Discussion and Analysis of the MLPerf Storage Benchmark Suite and AI Storage Workloads

Submitted by Anonymous (not verified) on

Storage for AI is rapidly changing: Checkpointing becomes more important as clusters scale to more accelerators, managing large KV-Caches from LLM queries shifts inference bottlenecks to storage, accessing relevant data via VectorDB similarity searches drives small IOs for nearly every query, and future applications may require wildly different storage architectures. The MLPerf Storage v2.0 Benchmark Results were just released and the v2.5 suite is under active development.

Blending Objects and Files in Google Cloud Storage

Submitted by Anonymous (not verified) on

Cloud object storage systems have been built to satisfy simple storage workloads where traditional POSIX semantics are sacrificed for simplicity and scalability. With AI and analytics workloads migrating towards hyperscale cloud computing, object storage users are increasingly requesting file-oriented access to their data.

Towards Building Flexible, Efficient and Resilient Training with Adaptive Checkpointing on AMD GPU Platforms

Submitted by Anonymous (not verified) on

Generative AI training is rapidly scaling in model size, data volume, and sequence length, requiring multiple instances for larger models. Distributed and parallel training strategies partition the training state across GPUs to support large-scale model training. As models and datasets grow, scaling infrastructure becomes a critical challenge. However, as AI infrastructure scales, the Mean Time Between Failures (MTBF) decreases, leading to more frequent job failures. Efficient recovery from failures is crucial, especially for resuming AI training.

Storage for AI in Public Clouds: Case Study of Vela in IBM Cloud

Submitted by Anonymous (not verified) on
Over the last few years IBM has built several AI-HPC clusters to experiment with and train its Large Language Models (LLMs). One of the clusters – Cloud Vela – is especially notable because it explored less conventional approaches for building HPC clusters. Vela relies on public cloud infrastructure, runs in VMs, uses Ethernet, and relies on container orchestrator (Kubernetes) to manage workloads and resources. Training jobs submitted by data scientists produce two types of I/O traffic: 1) reading training data and 2) writing large periodic checkpoints.

CDMI 3.0: Standardized Management of any URI-accessible Resource

Submitted by Anonymous (not verified) on

CDMI 3.0 is the third major revision of the Cloud Data Management Interface, which provides a standard for discovery and declarative data management of any URI-accessible data resource, such as LUNs, files, objects, tables, streams, and graphs. Version 3 of the standard reorganizes the specification around data resource protocol "exports", data resource declarative "metadata", and adds new support for "rels", which describe graph relationships between data resources.

Chiplets, UCIe, Persistent Memory, and Heterogeneous Integration: The Processor Chip of the Future!

Submitted by Anonymous (not verified) on

Chiplets have become a near-overnight success with today’s rapid-fire data center conversion to AI. But today’s integration of HBM DRAM with multiple SOC chiplets is only the very beginning of a larger trend in which multiple incompatible technologies will adopt heterogeneous integration to connect new memory technologies with advanced logic chips to provide both significant energy savings and vastly-improved performance at a reduced price point.

Storage Devices for the AI Data Center

Submitted by Anonymous (not verified) on
The transformational launch of GPT-4 has accelerated the race to build AI data centers for large-scale training and inference. While GPUs and high-bandwidth memory are well-known critical components, the essential role of storage devices in AI infrastructure is often overlooked. This presentation will explore the AI processing pipeline within data centers, emphasizing the crucial role of storage devices such as SSDs in compute and storage nodes. We will examine the characteristics of AI workloads to derive specific requirements for flash storage devices and controllers.
Subscribe to