Over the last few years IBM has built several AI-HPC clusters to experiment with and train its Large Language Models (LLMs). One of the clusters – Cloud Vela – is especially notable because it explored less conventional approaches for building HPC clusters. Vela relies on public cloud infrastructure, runs in VMs, uses Ethernet, and relies on container orchestrator (Kubernetes) to manage workloads and resources. Training jobs submitted by data scientists produce two types of I/O traffic: 1) reading training data and 2) writing large periodic checkpoints. To satisfy the I/O requirements of Vela we architected a tiered storage system that runs a distributed file system (GPFS) on top of cloud object storage.
In my presentation I will start with an overview of Vela’s design, how data scientists interact with it, and what workloads they run. I will then dive in the details of the storage system architecture and implementation for Vela. I will describe the performance and semantical challenges of the native cloud storage options and why we resorted to using a distributed file system over object storage. I will cover the design of our data mover and the solution integration w/ Kubernetes that makes it easy for the end user to provision file system volumes backed by object buckets without administrator intervention. I will also talk about our challenges related to observability and cache sizing.