Rethinking Ceph Architecture for Disaggregation Using NVMe-over-Fabrics

Monday, September 24, 2018
Ceph protects data by making 2-3 copies of the same data but that means 2-3x more storage servers and related costs. It also means higher write latencies as data hops between OSD nodes. Customers are now starting to deploy Ceph using SSDs for high-performance workloads and for data lakes supporting real-time analytics. We describe a novel approach that eliminates the added server cost by creating Containerized, stateless OSDs and leveraging NVMe-over-fabrics to replicate data in server-less storage nodes. We propose redefining the boundaries of separation within SDS architectures to address disaggregation overheads. Specifically, we decouple control and data plane operations and transfer block ownership to execute on remote storage targets. It also dramatically reduces write latency to enable Ceph to be used for databases and to speed up large file writes. As part of the solution, we also describe how OSD node failover is preserved via a novel mechanism using standby stateless OSD nodes. Learning Objectives: 1. Storage disaggregation 2. NVMe over fabrics 3. Ceph architecture