Highly Scalable, Masterless, Distributed Filesystem at Rubrik

Avishek Ganguli

Related Sessions

Why S3FS Fails in AI/ML — and How to Achieve Scalable POSIX Access Anyway

Mounting S3-compatible storage via S3FS seems like an easy way to enable POSIX-like access in Kubernetes. But in real AI/ML workloads—e.g., training with PyTorch or TensorFlow—we hit major issues: crashes from incomplete writes, vanished checkpoints, inconsistent metadata, and unpredictable I/O latency.

In this session, we’ll share how we overcame these challenges by designing a scalable, POSIX-compliant distributed file system that still leverages the cost-effectiveness of object storage. Instead of abandoning object storage, we rebuilt the access layer for better consistency, performance, and observability in large-scale environments.

Attendees will gain insight into architectural trade-offs, POSIX compliance in user space, Kubernetes integration via CSI and Operators, and observability benchmarks collected from real production AI training clusters.

Ideal for platform engineers, MLOps, and K8s architects seeking reliable, scalable storage for data-heavy workloads.

This is an intermediate session; attendees should be comfortable with object storage, file storage, and the basic concepts of the Kubernetes CSI driver.

Rui Su

Co-Founder,

Juicedata

Highly Scalable, Masterless, Distributed Filesystem at Rubrik

Abhinav Agarwal

Avishek Ganguli

Scalable Metadata in Distributed File Systems: Revisiting the GoogleFS Design for Exabyte-Scale Namespaces

GoogleFS introduced the architectural separation of metadata and data, but its reliance on a single active master imposed fundamental limitations on scalability, redundancy, and availability. This talk presents a modern metadata architecture, exemplified by SaunaFS, that eliminates the single-leader model by distributing metadata across multiple concurrent, multi-threaded servers. Metadata is stored in a sharded, ACID-compliant transactional database (e.g., FoundationDB), enabling horizontal scalability, fault tolerance through redundant metadata replicas, reduced memory footprint, and consistent performance under load. The result is a distributed file system architecture capable of exabyte-scale operation in a single namespace while preserving POSIX semantics and supporting workloads with billions of small files.

Piotr Modrzyk

Principal Architect,

Leil Storage

Open Flash Platform: An Initiative for Open, Highly Efficient AI Storage

Enterprise IT infrastructures face soaring AI and analytics demands, driving the need for storage that leverages existing networks, cuts power-hungry server counts, and frees CAPEX for AI. Yet current solutions create isolated silos: proprietary, server-based systems that waste power, lack cloud connectivity, and force large teams to manage multiple silo technologies—locking data behind vendor walls and hampering AI goals. Modeled on the Open Compute Project, the Open Flash Platform (OFP) liberates high-capacity flash through an open architecture built on standard pNFS which is included in every Linux distribution. Each OFP unit contains a DPU-based Linux instance and network port, so it connects directly as a peer—no additional servers. By removing surplus hardware and proprietary software, OFP lets enterprises use dense flash efficiently, halving TCO and increasing storage density 10×. Early configurations deliver up to 48 PB in 2U and scale to 1 EB per rack, yielding a 10× reduction in rack space, power, and OPEX and a 33 % longer service life. This session explains the vision and engineering that make OFP possible, showing how open, standards-based architecture can simplify, scale, and free enterprise data.

David Flynn

CEO, Co-Founder,

Hammerspace

Choosing Your AI Storage Protocol: A Deep Dive into SMB and NFS Performance, Tuning, and Overhead

The performance of network file protocols is a critical factor in the efficiency of the AI and Machine Learning pipeline. This presentation provides a detailed comparative analysis of the two leading protocols, Server Message Block (SMB) and Network File System (NFS), specifically for demanding AI workloads. We evaluate the advanced capabilities of both protocols, comparing SMB3 with SMB Direct and Multichannel against NFS with RDMA and multistream TCP configurations. The industry-standard MLPerf Storage benchmark is used to simulate realistic AI data access patterns, providing a robust foundation for our comparison. The core of this research focuses on quantifying the performance differences and identifying the operational and configuration overhead associated with each technology.

Sergei Platonov

VP of Strategy,

Xinnor