At our SNIA Cloud Storage Technologies webinar, “Ceph Storage in a World of AI/ML Workloads,” our experts, Kyle Bader from IBM and Philip Williams from Canonical, explained how open source solutions like Ceph can provide almost limitless scaling capabilities, both for performance and capacity for AI/ML workloads. If you missed the presentation, it’s available at the SNIA Educational Library along with the slides.
The live webinar audience was highly engaged with this timely topic, and they asked several questions. Our presenters have generously taken the time to answer them here.
Q: What does checkpoint mean?
A: Checkpointing is storing the state of the model (weights and optimizer states) to storage so that if there is an issue with the training cluster, training can resume from the last checkpoint instead of starting from scratch.
Q: Is CEPH a containerized solution?
A: One of the ways to deploy Ceph is as a set of containers. These can be hosted directly on conventional Linux hosts and coordinated with cephadm / systemd / podman, or in a Kubernetes cluster using an operator like Rook.
Q: Any advantages or disadvantages of using Hardware RAID to help with data protection and redundancy along with Ceph?
A: The main downside to using RAID under Ceph is the additional overhead. You can adjust the way Ceph does data protection to compensate, but generally that changes the availability of the pools in a negative way. That said, you could, in theory, construct larger storage systems, in terms of total PBs if you were to have an OSD per aggregate instead of per physical disk.
Q: What are the changes we are seeing to upgrade the storage hardware with AI? Only GPUs or is there other specific hardware to be upgraded?
A: There are no GPU-upgrades required for the storage hardware. If you’re running training or inference co-resident on the storage hosts then you could include GPUs, but for standalone storage serving AI workloads, there is no need for GPUs in the storage system themselves. Ceph off the shelf servers configured for high throughput is all that’s needed.
Q: Does Ceph provide AI features?
A: In the context of AI, the biggest thing that is needed, and that is provided by Ceph, is scalability in multiple dimensions (capacities, bandwidth, IOPs, etc). We also have some capabilities to do device failure predictions using a model.
Q: How do you see a storage admin career in AI Industry and what are the key learnings needed?
A: Understanding how to use scale-out storage technologies like Ceph. Understanding the hardware, differences in types of SSDs, networking - basically, feeds and speeds type stuff. It's also essential to learn as much as possible about what AI practitioners are doing, so that you can "meet them in their world" and have constructive conversations.
Q: Any efforts to move the processing into the Ceph infrastructure so the data doesn't have to move?
A: Yes! At the low level, RADOS (Reliable Autonomic Distributed Object Store) has always had classes that can be executed on objects, they tend to be used to provide the semantics needed for different protocols. So, at the core, Ceph has always been a computational storage technology. More recently as an example, we’ve seen S3 Select added to the object protocol, which allows pushdown of filtering and aggregation - think pNFS but for tabular data, with storage side filtering and aggregation.
Q: What is the realistic checkpoint frequency?
A: The best thing to do is to checkpoint every round, but that might not be viable depending on the bandwidth of the storage system, the size of the checkpoint, the amount of data parallelization in the training pipeline, and whether or not they are leveraging asynchronous checkpointing. The more frequently the better. As the GPU cluster gets bigger, the need to checkpoint more frequently goes up, because they need to protect against failures in the training environment.
Q: Why train with Ceph storage instead of direct-attached NVMe storage? That would speed up the training by orders of magnitude.
A: When you’re looking at modest data set sizes and looking for ways to do significant levels of data parallelization, yes, you could copy these data sets onto local attach NVMe storage. In this case, yes, you would accomplish faster results, just because that's how physics works.
However, for larger recommendation systems, you may be dealing with much larger training data set sizes, and you might not be able to fit all of the necessary data onto local NVMe storage of the system. In this case, there are a number of trade-offs people make that favor the use of external Ceph storage, including the size of your GPU system, the need for more flexibility, as well as the need for experimentation to test various ways to accomplish data level parallelism and data pipelining. All of this is intended to maximize the use of the GPUs, and causes readjustment on how you partition, pre-load and use data on local NVMe versus external Ceph storage. Flexibility and experimentation are important, and there are always trade-offs.
Q: How can we integrate Ceph to an HCI environment using VMware?
A: It’s possible to use NVMe-oF for that.
Q: Is there a QAT analogue (compression) on EPYCs?
A: Not today, you could use one of the legacy PCIe add in cards though.
Leave a Reply