In a little over a month, more than 1,500 people have viewed the SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Ceph: The Linux of Storage Today,” with SNIA experts Vincent Hsu and Tushar Gohad. If you missed it, you can watch it on-demand at the SNIA Educational Library. The live audience was extremely engaged with our presenters, asking several interesting questions. As promised, Vincent and Tushar have answered them here.
Given the high level of this interest in this topic, the CSTI is planning additional sessions on Ceph. Please follow us @sniacloud_com or at SNIA LinkedIn for dates.
Q: How many snapshots can Ceph support per cluster? Q: Does Ceph provide Deduplication? If so, is it across objects, file and block storage?
A: There is no per-cluster limit. In the Ceph filesystem (cephfs) it is possible to create snapshots on a per-path basis, and currently the configurable default limit is 100 snapshots per path. The Ceph block storage (rbd) does not impose limits on the number of snapshots. However, when using the native Linux kernel rbd client there is a limit of 510 snapshots per image.
There is a Ceph project to support data deduplication, though it is not available yet.
Q: How easy is the installation setup? I heard Ceph is hard to setup.
A: Ceph used to be difficult to install, however, the ceph deployment process has gone under many changes and improvements. In recent years, the experience has been very streamlined. The cephadm system was created in order to bootstrap and manage the Ceph cluster, and Ceph also can now be deployed and managed via a dashboard.
Q: Does Ceph provide good user-interface to monitor usage, performance, and other details in case it is used as an object-as-a-service across multiple tenants?
A: Currently the Ceph dashboard allows monitoring the usage and performance at the cluster level and also at a per-pool basis. This question falls under consumability. Many people contribute to the community in this area. You will start seeing more of these management tool capabilities being added, to see a better profile of the utilization efficiencies, multi-tenancy, and qualities of service.
The more that Ceph becomes the substrate for cloud-native on-premises storage, the more these technologies will show up in the community. Ceph dashboard has come a long way.
Q: A slide mentioned support for tiered storage. Is tiered meant in the sense of caching (automatically managing performance/locality) or for storing data with explicitly different lifetimes/access patterns?
A: The slide mentioned the future support in Crimson for device tiering. That feature, for example, will allow storing data with different access patterns (and indeed lifetimes) on different devices. Access the full webinar presentation here.
Q: Can you discuss any performance benchmarks or case studies demonstrating the benefits of using Ceph as the underlying storage infrastructure for AI workloads?
A: The AI workloads have multiple requirements that Ceph is very suitable for:
- Performance: Ceph can provide the high performance demands that AI workloads can require. As a SDS solution, it can be deployed on different hardware to provide the necessary performance characteristics that are needed. It can scale out and provide more parallelism to adapt to increase in performance demands. A recent post by a Ceph community member showed a Ceph cluster performing at 1 TiB/s.
- Scale-out: Scale was built from the bottom up as a scale out solution. As the training and inferencing data is growing, it is possible to grow the cluster to provide more capacity and more performance. Ceph can scale to thousands of nodes.
- Durability: Training data set sizes can become very large and it is important that the storage system itself takes care of the data durability, as transferring the data in and out of the storage system can be prohibitive. Ceph employs different techniques such as data replication and erasure coding, as well as automatic healing and data re-distribution to ensure data durability
- Reliability: It is important that the storage systems operate continuously, even as failures are happening through the training and inference processing. In a large system where thousands of storage devices failures are the norm. Ceph was built from the ground up to avoid a single point of failure, and it can continue to operate and automatically recover when failures happen.
Leave a Reply