SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
Large-scale cluster storage systems contain hundreds-of-thousands of hard disk drives in their primary storage tier. Since the clusters are not built all at once, there is significant heterogeneity among the disks in terms of their capacity, make/model, firmware, etc. Redundancy settings for data reliability are generally configured in a “one-scheme-fits-all” manner assuming that this heterogeneous disk population has homogeneous reliability characteristics. In reality we observe that different disk groups fail differently, causing clusters to have significantly high disk-reliability heterogeneity. This research paves the way for exploiting disk reliability heterogeneity to tailor redundancy settings to different disk groups for cost-effective, and arguably safer redundancy in large-scale cluster storage systems. Our first contribution is an in-depth data-driven analysis of disk reliability of over 5.3 million disks across over 60 makes/models in three large production environments (Google, NetApp and Backblaze). We observe that the strongest disks can be over 100x more reliable than the weakest disks in the same storage cluster. This makes today’s static redundancy schemes selection either insufficient, or wasteful, or both. We quantify the opportunity of achieving lower storage cost with increased data protection by means of disk-adaptive redundancy. Our next contribution is designing three disk-adaptive redundancy systems: HeART, Pacemaker and Tiger. By processing disk failure data over time, HeART identifies the boundaries and steady-state failure rate for each deployed disk group by make/model and suggests the most space-efficient redundancy option allowed that will achieve the specified target data reliability. HeART is simulated on a large production cluster with over 100K disks. HeART could meet target data reliability levels with 11–16% fewer compared to erasure codes like 10-of-14 or 6-of-9 and up to 33% fewer compared to 3-replication. While HeART promises substantial space-savings, the IO load of transitions between redundancy schemes overwhelms the storage infrastructure (termed transition overload) renders HeART impractical. Building on the insights drawn from our data-driven analysis, Pacemaker is the next contribution of this work; a low-overhead disk-adaptive redundancy orchestrator that realizes HeART’s dream in practice. Pacemaker mitigates transition overload by (1) proactively organizing data layouts to make future transitions efficient, (2) initiating transitions proactively in a manner that avoids urgency while not compromising on space-savings, and (3) designing more IO efficient redundancy transitioning mechanisms. Evaluation of Pacemaker with traces from four large (110K–450K disks) production clusters shows that the transition IO requirement decreases to never needing more than 5% cluster IO bandwidth (only 0.2–0.4% on average). Pacemaker achieves this while providing overall space-savings of 14–20% (compared to using a static 6-of-9 scheme) and never leaving data under-protected. Tiger improves on Pacemaker by removing the placement constraint requiring stripes be placed within disks of similar reliability through a new striping primitive called eclectic stripes. Eclectic stripes provide more placement flexibility, better reliability and higher risk-diversity without compromising on space-savings offered by Pacemaker. Finally we describe prototypes of Pacemaker and Tiger in HDFS by repurposing existing components. This exercise serves as a guideline for future systems that wish to support disk-adaptive redundancy.
DMTF’s Redfish® is a standard API designed to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center (SDDC). This presentation will provide an overview of DMTF’s Redfish standard. It will also provide an overview of HPE’s implementation of Redfish, focusing on their storage implementation and needs. HPE will provide insights into the benefits and challenges of the Redfish Storage model, including areas where functionality added to SNIA™ Swordfish is of interest for future releases.
Hadoop becomes imperative to process a large and complex set of data. However, often the issue of architecture scalability pose unnecessary roadblock in this process. OpenStack – an open-source software instills the required operational flexibility to scale-out architecture for Hadoop. This session will throw light on how we helped a client install Hadoop on VMs using OpenStack. We will discuss in-depth the challenges of manual operations and how we overcame them with Sahara. The audience will also learn why we virtualized hardware on the computing nodes of OpenStack and deployed VMs.
The SNIA Swordfish™ ecosystem is supported by open source tools, available in open repositories that are managed by the SNIA Scalable Storage Management Technical Working Group on GitHub, and the DMTF Redfish Forum, also on Github. This session will walk through the tools you can use to go from zero to working SNIA Swordfish implementations. Starting from generating, validating, and using static mockups, using the emulator to make your mockups “come alive,” and then verifying your Swordfish service outputs match your expectations using open source validation tools; the same tools that feed into the Swordfish Conformance Test Program.
The SSM TWG and OFA OFMFWG are working together to bring to life an open source Open Fabric Management Framework, with a Redfish/Swordfish management model and interface. This presentation will provide an overview of the status of this work, and a demo of the current state of the proof of concept, built leveraging the Redfish and Swordfish-based open source emulator.
The SNIA Swordfish specification has expanded to include full NVMe and NVMe-oF enablement and alignment across DMTF, NVMe, and SNIA for NVMe and NVMe-oF use cases. This presentation will provide an overview of the most recent work adding detailed implementation requirements for specific configurations, ensuring NVMe and NVMe-oF environments can be represented entirely in Swordfish and Redfish environments.
NVMe-oF drives can support NVMe over ethernet, but how do you manage them? This presentation will show how Swordfish has developed a standard model for NVMe ethernet-attached drives, providing detailed profiles as guidance for implementations including required and recommended properties. The profiles are now part of the Swordfish CTP program; ethernet attached drives can validate conformance to the specifications by participating.