Sorry, you need to enable JavaScript to visit this website.

SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA

Why S3FS Fails in AI/ML — and How to Achieve Scalable POSIX Access Anyway

Lafayette

Wed Sep 17 | 2:30pm

Abstract

Mounting S3-compatible storage via S3FS seems like an easy way to enable POSIX-like access in Kubernetes. But in real AI/ML workloads—e.g., training with PyTorch or TensorFlow—we hit major issues: crashes from incomplete writes, vanished checkpoints, inconsistent metadata, and unpredictable I/O latency.

In this session, we’ll share how we overcame these challenges by designing a scalable, POSIX-compliant distributed file system that still leverages the cost-effectiveness of object storage. Instead of abandoning object storage, we rebuilt the access layer for better consistency, performance, and observability in large-scale environments.

Attendees will gain insight into architectural trade-offs, POSIX compliance in user space, Kubernetes integration via CSI and Operators, and observability benchmarks collected from real production AI training clusters.

Ideal for platform engineers, MLOps, and K8s architects seeking reliable, scalable storage for data-heavy workloads.

This is an intermediate session; attendees should be comfortable with object storage, file storage, and the basic concepts of the Kubernetes CSI driver.