2020 Storage Developer Conference Abstracts

Break Out Sessions and Agenda Tracks Include:

Note: This agenda is a work in progress. Check back for updates on additional sessions as well as the agenda schedule.

Blockchain

Blockchain in Storage Why use it

Olga Buchonina, CEO, ActionSpot

Abstract

We will describe and showcase Burst Coin and IPFS technology while using Blockchain. In this presentation, we will explore how Blockchain can be used in Storage and why using blockchain can help to improve and address: latency, security, and data integrity.

Using NVMe and NVMe-oF can improve not only performance and latency but also address the scalability issues in Blockchain while addressing the market needs.


Off chain storage for Block Chain

Ramya Krishnamurthy, QA Architect [Test Expert], HPE

Ajay Kumar, Senior Test Specialist, HPE

Abstract

Storage infrastructure scalability for block chain can be provided with off chain storage: Off-chain data can be any data that is too large to be stored in the Block chain efficiently or requires the ability to be changed or deleted.

Off chain data is classified as any structured or unstructured data which cannot be stored in block-chain. For example, Media and documentation files like JPEGs and text files.

The proposed talk will also cover the aspects of storage infrastructure needed for providing off-chain storage since storage infrastructure is a critical element in block chain environment:

Flash/NVME technology for performance - Off-chain storage unburdens the block chain of storing large datasets but this can impact performance since physical disks are slow. Storage technologies such as NVMe reduce processor cycle requirements for storage and help is achieving performance be using NVME SSD drives

Ability to easily scale up/out to petabytes

  • We can easily scale up by adding additional disks/memory and scale out by adding additional nodes to the storage cluster thereby enabling to store petabytes of data

Backup-and-recovery via snapshot or continuous data synchronization with storage product functions

  • To ensure that offchain data is not lost we recommend backup of the data using virtual copies [snapshots] for easy recovery. Continuous data synchronization can be achieved by replication solution [either synchronous or asynchronous replication]

Inherent ability for data reduction with de-duplication/compression

  • We can achieve space savings using deduplication and compression technology This will lead to total storage of the system to be used wisely

Example of a vendor specific solution

Cloud

Decentralized Platforms Push Edge Networks Closer to the Edge

Ben Golub, CEO, Storj Labs

Abstract

Content delivery networks, or edge networks, deliver much of the most-consumed content and data online. Files on these services are spread across a large number of devices in different geographical locations to quickly serve up popular content to users in the same region.

With the growth of online streaming, faster bandwidth speeds, higher bandwidth caps and more services moving online, these edge networks continue to grow in popularity to ensure data consumers are not kept waiting.

The downside of edge networks is that they are expensive to upkeep. However, decentralization can change that while also accelerating performance by pushing data even closer to where it is consumed.

By leveraging underutilized hardware spread around the globe, decentralized cloud storage networks that are architected similar to CDNs can push data into the very neighborhoods where it is consumed, instead of the larger region. Rather than stream a movie on Netflix from a CDN hundreds of miles away, networks built on decentralized cloud storage architectures might stream it from a device in the neighborhood or a data center down the street.

This session will look at how decentralized cloud storage services can augment edge networks, deliver stellar performance for data streaming and keep data secure and private.


Deep dive on architecting storage application for the public cloud economy

Josh Salomon, Senior Principal Software Engineer, Red Hat

Orit Wasserman, Senior Principal Software Engineer, Red Hat

Abstract

The public cloud presents a new economic model on computing - "pay as you go" model in which you pay only for what you consume, but you pay for everything: compute, storage, networking, QoS and more. This model suggests new considerations for applications architecture that minimize public cloud cost, the presentation discusses the need for storage applications on public cloud, new architecture considerations (with applicability for all types of applications) and presents alternatives for cost reduction of storage applications in the public cloud. This presentation is a sequel to the presentation in SDC EMEA 2020, and includes more in depth discussion on instance storage usage as well as discussing ways to use spot instances for storage.


Decoupling SDS architectures for agility

Arun Raghunath, Research Scientist, Intel Corp

Yi Zou, Research scientist, Intel Corp

Abstract

Cloud-native distributed storage services (object, DB, KV, streaming log) typically provide capacity scale-out, availability, durability guarantees via software. But the high performance of new media is lost under software layers. Advances in storage media/protocols present a heterogeneous storage environment needing innovative integration approaches to mixed media types. Also, storage services must scale for emerging elastic applications that are dynamic and demand short-duration performance boosts for subsets of data, without over-provisioning or incurring rebalance overheads while scaling out.

We propose to decouple cluster level tasks from the I/O path in SDS architectures. This enables containerized SDS modules with new scale-out vectors in compute and caching. They can be scaled based on client load or to reduce performance impact of SDS tasks like scrubbing & recovery. With no rebalance overheads, they can also be affinitized to applications, for better performance & match application mobility by cache pre-fetching. We share our experience in decoupling SDS layers over NVMe-oF using Ceph as a case study. We demonstrate issues like handling remote asynchronous transactions between decoupled components, and provide PoC performance data. We discuss challenges in locating data when SDS components are dynamic; as well as NVMe-oF advances needed to support distributed storage services. We hope this leads to a community discussion on open questions that remain in this space.

Computational Storage

Computational Storage, from edge to cloud

Jerome Gaysse, Senior Technology and Market Analyst, Silinnov Consulting

Abstract

There are at least 3 main trends for computational storage architectures: SSD with embedded computing, SSD attached to computing acceleration card, SSD attached to smart NIC or smart HBA.

The challenge is to understand the real value of each solution and identify the use cases where it provides the best ROI, as such technologies may lead to major hardware and software changes in system design.

This talk presents an analysis of computational storage applications examples, from edge to cloud, highlighting the system benefits in term of power saving, performance increase and TCO reduction.


Deploying Computational Storage at the Edge

Scott Shadley , VP Marketing, NGD Systems

Abstract

With the growth of data generation at the Edge and need to get value from that data quickly, the market has run into a hurdle on how to get enough compute and processing with the available space, power and budget. The ability to deploy compute resources within the storage devices with Computational Storage is key to the growth of this market.

This presentation will discuss the deployment of small form factor, asic-based, solutions that bring value to end customers and platform developers. Including a specific use case to be showcased.


The True Value of Storage Drives with Built-in Transparent Compression: Far Beyond Lower Storage Cost

Tong Zhang, Chief Scientist, ScaleFlux

Abstract

This talk will reveal that, beyond reducing the data storage cost, emerging solid-state drives with built-in transparent compression bring exciting but largely unexplored opportunities to innovate the data storage management software stack (e.g., relational database, key-value store, and filesystem). The simple idea of integrating lossless data compression into storage drives is certainly not new and can trace back to decades ago. However, high-performance PCIe solid-state drives with built-in transparent compression have remained elusive on commercial market until recently. In addition to the straightforward storage cost saving, such a new breed of storage drives decouples the logical storage space utilization efficiency from the physical flash storage space utilization efficiency. As a result, it allows the data management software purposely “waste” the logical storage space in return for employing much simpler data structures and algorithms, without sacrificing the physical storage cost. Naturally, simpler data structures and algorithms come with higher performance and/or lower CPU/memory usage. This creates a large but unexplored space for re-thinking the data management software stack design. This talk will present our recent work on exploring this new territory in the context of relational database and key-value store. In particular, this talk will (1) introduce the basics of storage drives with built-in transparent compression and their implementation challenges, (2) discuss how one could configure or even slightly modify MySQL and PostgreSQL (the two most popular relational database) in order to significantly benefit from such storage drives in terms of both performance and cost, and (3) present a new open-source key-value store that is created from the scratch in order to take full advantage of such storage drives, which can achieve higher performance and efficiency than any existing key-value store solutions. As storage drives with built-in transparent compression are quickly entering the commercial market, it is our hope that this talk could inspire the data storage community to develop many more elegant ideas that can make future data management stack fully embrace and benefit from such storage drives.


SmartNICs and SmartSSDs, the Future of Smart Acceleration

Scott Schweitzer, Systems Architect, Xilinx Inc.

Abstract

Since the advent of the Smart Phone over a decade ago, we've seen several new "Smart" technologies, but few have had a significant impact on the data center until now. SmartNICs and SmartSSDs will change the landscape of the data center, but what comes next? This talk will summarize the state of the SmartNIC market by classifying and discussing the technologies behind the leading products in the space. Then it will dive into the emerging technology of SmartSSDs and how they will change the face of storage and solutions. Finally, we'll dive headfirst into the impact of PCIe 5 and Compute Express Link (CXL) on the future of Smart Acceleration on solution delivery.


Deep Compression at Inline Speed for All-Flash Array

Chris Mao, Principal Engineer, Pure Storage

Abstract

The rapid improvement of overall $/Gbyte has driven the high performance All-Flash Array to be increasingly adopted in both enterprises and cloud datacenters. Besides the raw NAND density scaling with continued semiconductor process improvement, data reduction techniques have and will play a crucial role in further reducing the overall effective cost of All-Flash Array.

One of the key data reduction techniques is compression. Compression can be performed both inline and offline. In fact, the best All-Flash Arrays often do both: fast inline compression at a lower compression ratio, and slower, opportunistic offline deep compression at significantly higher compression ratio. However, with the rapid growth of both capacity and sustained throughput due to the consolidation of workloads on a shared All-Flash Array platform, a growing percentage of the data never gets the opportunity for deep compression.

There is a deceptively simple solution: Inline Deep Compression with the additional benefits of reduced flash wear and networking load. The challenge, however, is the prohibitive amount of CPU cycles required. Deep compression often requires 10x or more CPU cycles than typical fast inline compression. Even worse, the challenge will continue to grow: CPU performance scaling has slowed down significantly (breakdown of Dennard scaling), but the performance of All-Flash Array has been growing at a far greater pace.

In this talk, I will explain how we can meet this challenge with a domain-specific hardware design. The hardware platform is a FPGA-based PCIe card that is programmable. It can sustain 5+Gbyte/s of deep compression throughput with low latency for even small data block sizes (TByte/s BW and <10ns of latency) and the almost unlimited parallelism available on a modern mid-range FPGA device. The hardware compression algorithm is trained with a vast amount of data available to our systems. Our benchmarks show it can match or outperform some of the best software compressors available in the market without taxing the CPU.


Next Generation Datacenters require composable architecture enablers and deterministic programmable intelligence

Jean-Francois Marie, Chief Solution Architect, Kalray

Abstract

For the past years flash drives have started to push performance boundaries. Storage OS based on x86 architectures even with more and more cores have a hard time to scale. Very few architectures can sustain the coming multi-million IOPS workloads expected from next generation Flash drives and memories. Only a multi-dimension scalable architecture can propose an alternative.

At the heart of it, determinism, parallel programming and easy of programming are requested. In this talk we will explain why it is important, what are the key components and how you could achieve such a performance. We will use Kalray’s Many Core processor and our SDK as an example to offload storage services such as NVMe-oF.


Emerging Data-Centric Storage Architectures

Pankaj Mehra, VP of Storage Pathfinding, Samsung Electronics

Abstract

Today’s datacenters and data lake sizes call for re-imagining traditional storage systems to increase efficiency and scalability. New, data-centric approaches avoid redundant address translations, minimize protocol terminations, and act as intelligent storage subsystems. We’ll focus on the latest trends in SSD that will dramatically impact storage deployments in 2021 and beyond. Some examples include discussion on Computational Storage and its use cases to speed up data analytics, cybersecurity search, video transcoding and AI/ML, Ethernet-attached SSD and Zoned Namespaces SSDs.


SkyhookDM: storage and management of tabular data in Ceph.

Jeff LeFevre, Adjunct Professor of Computer Science and Engineering, UC Santa Cruz, University of California, Santa Cruz

Carlos Maltzahn, Adjunct Professor of Computer Science & Engineering, University of California, Santa Cruz

Abstract

The Skyhook Data Management project (skyhookdm.com) at UC Santa Cruz brings together two very successful open source projects, the Ceph object storage system, and the Apache Arrow cross-language development platform for in-memory analytics. It introduces a new class of storage objects to provide an Apache Arrow-native data management and storage system for columnar data, inheriting the scale-out and availability properties of Ceph. SkyhookDM enables single-process applications to push relational processing methods into Ceph and thereby scale out across all nodes of a Ceph cluster in terms of both IO and CPU. To highlight the benefits, we will present performance for various physical layouts and query workloads over example tables of 1 billion rows, as we scale out the number of nodes in a Ceph cluster.

In this talk, we first describe how we partition Apache Arrow columnar data into Ceph objects. We consider both horizontal and vertical partitioning (rows vs. columns) of tables. In contrast to objects storing opaque byte streams where the meaning of the data must be interpreted by a higher level application, Apache Arrow data can be partitioned along semantic boundaries such as columns and rows so that relational operators like selection and projection can be performed in objects storing semantically complete data partitions.

Next we introduce our SkyhookDM extensions that utilize Ceph’s “CLS” plugin infrastructure to execute our methods directly on objects, within the local OSD context. These access methods use the Apache Arrow access library to operate on Arrow data within the context of an individual object and implement relational processing methods, physical data layout changes, and localized indexing of data. Relational processing methods include SELECT, PROJECT, ORDER BY, and GROUP BY with partial aggregations (e.g., local min, max, sum, count, etc.). Physical data layout operations currently supported include transforming objects between row and column layouts, which we plan to extend to co-group columns on objects. Localized indexing is performed as a new object write method and supports index lookups that are beneficial to both point queries and range queries.

SkyhookDM is accessed via a user-level C++ library on top of librados. The SkyhookDM library comes with Python bindings and is used in a PostgreSQL Foreign Data Wrapper. The source code is available at github.com/uccross/skyhookdm-ceph-cls under LGPLv2. SkyhookDM is an open source incubator project at the Center for Research in Open Source Software at UC Santa Cruz (cross.ucsc.edu). This work was in part supported by the National Science Foundation under Cooperative Agreement OAC-1836650.


Implementing Computational Storage Solutions

Neil Werdmuller, Director of Storage Solutions, Arm, Ltd.

Jason Molgaard, Principal Storage Solutions Architect, Arm Inc.

Abstract

Moving large amounts of data between storage and compute cannot scale given the ever-increasing storage capacities. A shift to computational storage that brings compute closer to the stored data provides the solution. Data-driven applications that benefit from database searches, data manipulation, and machine learning can perform better and be more scalable if developers add computation directly to storage. Flexibility is key in the architecture of a Computational Storage device hardware and software implementation. Hardware flexibility minimizes development cost, controller cost, and controller power. Software flexibility leverages existing ecosystems and software stacks to simplify code development and facilitate workload deployment to the compute on the drive. Implementing the hardware and software flexibility into a Computational Storage Drive requires forethought and deliberate consideration to achieve a successful solution. This presentation will show how to simplify computational storage architectures. Attendees will walk away with how to reduce power, area, and complexity of their computational storage controller, and leverage Linux and the Linux ecosystem of software to facilitate software development and workload management by taking advantage of computational storage capabilities

Container Storage

Bring Agility and Data Management to your Hybrid Multicloud DevOps strategy with Trident

Ron Feist, Subject Matter Expert Elite for Hybrid Cloud, NetApp

Abstract

Container orchestrators such as Kubernetes enable the automation of deployment, scaling, and management of applications in your cloud of choice. How do you utilize move data on demand between your private and public clouds? Trident is an open-source storage orchestrator for containers maintained by NetApp that makes it trivial to connect the available cloud and on-premises storage options to your containers. This session will outline using Trident in several deployment scenarios for cloud-based containerized applications as well as on-premises storage management using the Container Storage Interface. We will also demonstrate how Trident allows you to move workloads between clouds, and how Astra will help you with automation and cataloging.


Leveraging Modern Network to Deliver Faster Storage to Database Workloads in Kubernetes

Amarjit Singh, Director, DevOps Solutions, Kioxia America

Abstract

Achieving optimal performance for Stateful database workloads has typically required the use of local storage drives that limited orchestration frameworks from scaling these workloads across a networked infrastructure.  With the advancement of storage protocols, such as NVMe and NVMe-oF and networking technologies such as RDMA, RoCE, SmartNICs, it is now possible to achieve DAS-like performance over a network.  Through resource disaggregation, compute and storage can scale independently of each other with the ability to dynamically allocate storage capacity and performance to applications as needed. In this presentation, we will discuss advancements in networking and storage technologies and how they are blending together to address the demands of modern workloads and scheduling frameworks (such as Kubernetes).  In the hands-on session, we will demonstrate how to leverage RDMA/RoCEv2 and TCP to integrate & provision faster storage on Kubernetes platforms.


One CSI plugin for All? Experimenting heterogeneous storage with single CSI plugin for Kubernetes

Sushantha Kumar, System Engineer, Huawei Technologies India Private Limited

Mohammad Asif Siddiqui, System Engineer, Huawei Technologies

Abstract

Kuberetes is a popular and well grown container orchestration system available today. With the growing number of application deployment, the persistent volume aspect for the application has become one key area. Container Storage interface(CSI) is the popular and conventional way by which different backend systems provision the storage for kubernetes applications. So there are numerous CSI drivers available to get the particular storage systems glued to the containerized applications.So in an ecosystem with the presence of heterogeneous storage backends, managing the multiple CSI drivers for different storages is a challenge and also an opportunity at same time if one looks at it.

If we can manage the heterogeneous storage drivers through a single CSI plugin with Kubernetes, it will ease the overall management and configurations. We have done some experiments to have one CSI plugin to handle multiple existing CSI Drivers.

In this session, We will be showcasing how this is achieved, demo and advantages. We will also showcase the step forward to add multiple data management features (like snapshot, replication and more) from SODA Foundation (under Linux Foundation) open source projects along with CSI with this integration model.


SPDK-CSI: Bring SPDK to Kubernetes Storage

Yibo Cai, Principal Software Engineer, Arm

Abstract

Kubernetes is the most popular container orchestration system for automating deployment, scaling, and management of containerized applications. Container Storage Interface(CSI) defines an industry standard that enables storage vendors to develop storage plugin that works for container orchestration systems.

Kubernetes has well CSI support. Legacy In-Tree storage drivers are deprecated. Storage vendors are encouraged to develop their own CSI compatible storage plugins to integrate with Kubernetes.

Storage Performance Development Kit (SPDK) provides a full block stack at user space with much higher performance than traditional kernel stack. SPDK also provides NVMe-oF, iSCSI servers that are capable of serving disks over the network, which is a perfect fit for cloud environment.

We started open source project SPDK-CSI (https://github.com/spdk/spdk-csi) to enable SPDK as storage provider of Kubernetes. It provisions SPDK backed volumes dynamically and enables Pods to access SPDK storage through NVMe-oF or iSCSI transparently.

In this talk, we will share our work of developing SPDK-CSI plugin and introduce CSI internals, SPDK-CSI detailed design and coding practice, CSI deployment and validation, SPDK JSON RPC, etc. These knowledge can be useful for those who are interested in Kubernetes CSI plugin development, or want to leverage SPDK to provide high performance containerized storage solution in Kubernetes.


Dynamic Provisioning in Kubernetes of Persistent Volume (PV) and PV Claim (PVC)

Divya Vijayakumar, Senior Software Engineer, MSys Technologies

Arun Kandasamy, Lead Programmer, MSys Technologies

Abstract

This presentation focuses on provisioning storage dynamically to a large framework orchestrated using Kubernetes. Generally, K8s Pods are mortal in nature, so when a pod dies, the data created inside it is lost forever. We propose a solution, which is called Dynamic Provisioning to overcome this data loss and to avoid manual creation of Persistent Volume (PV) and PV Claim (PVC). The audience will learn to execute Container Storage Interface (CSI) Volume methodology to create storage volume on demand at run time.

The session will help understand how the Kubernetes CSI driver enables Container Storage Provider (CSP) to perform data management operations on storage resources. The session will explain how CSI creates and deploys plugins to expose new storage systems with load balancing in Kubernetes without tweaking the core code. We will also learn how storage class and central storage, which are provided by Nimble Storage Array can be leveraged to create and manage CSI volume.

Data Protection and Data Security

Data Sovereign Collaboration Platform for Autonomous Vehicles

Radha Krishna Singuru, DMTS - Senior Member, Wipro Technologies

Akhil Gokhale, Managing Consultant, Wipro Technologies

Abstract

In today’s digital world, Data sovereignty becomes a big area of concern for various countries. Many countries are concerned about the data, that is generated locally, and being processed and persisted beyond their geographical control. This is particularly relevant in the age of cloud computing with global development centers for processing and persisting data.

Governments across the globe are enacting new and stringent laws for data sovereignty. This is pushing various industries specifically the autonomous vehicle industry to innovate more and create new business models for data sovereignty compliance. There is a need to build an innovative Rapid Collaboration Platform (RCP) that can address data sovereignty concerns on where the data will be stored, how it complies with local laws, ensure data privacy and data security while enabling to perform Business as Usual activities more efficiently.

AV RCP is a collaboration platform that enables Data Driven Development of Autonomous Driving built on the principles of distributed architecture to solve various challenges faced by autonomous vehicle development teams. It enables seamless access to vehicle test data to various users as software engineer, algorithm developer, applied scientist, ML engineer etc. It provides global search engine, native compute engine and various data management tools so that test data is available and accessible to various teams spread across various regions. The distributed architecture ensures data remains at the place where it is collected. This not only ensures data sovereignty but also optimizes on the network bandwidth usage and cost and helps improve the productivity of the overall AV development team. It supports hybrid cloud deployment options that can build on existing on premise infra investments as well as leverage latest technological capabilities from public clouds.

The same platform can be extended for other industry segments like Health, Auto, Energy, Utilities etc.


Data Preservation & Retention 101

Thomas Rivera, Strategic Success Manager, VMware Carbon Black

Abstract

There are many instances in which the terms "retention" and "preservation" are used interchangeably and incorrectly. This can result in different and conflicting requirements that govern how the same information is maintained, how long it must be kept, and whether and how it is protected and secured. This session highlights the differences between retention and preservation.


OS Level Encryption for Superior Data Protection

Peter Scott, Senior Engineer, Thales, Inc

Rajesh Gupta, Senior Engineer, Thales, Inc

Abstract

While protecting data at rest, or live data, using a hardware-based approach is efficient and fast, it does not allow the flexibility of per file access control and data protection. The approach we have taken at Thales allows for per file access control and transparent data protection while providing the flexibility to rotate keys within a distributed key management system without effecting access. This solution covers a wide range of platforms, but this talk will be limited to the Windows implementation which leverages a Layered File System to achieve transparency. Some of the features that will be discussed include:

  • How to support per file access control in a distributed system
  • Managing access to files undergoing a transformation or key rotation in both local and network environments
  • Allowing for access to encrypted content while providing clear text access to files simultaneously

Diving into each of these topics, with side bars, we will provide the audience a clear picture of the complexities involved. For example, in a distributed environment how does one ensure that during key rotations all clients are using the correct key for data encryption for various ranges of the file without falling back to a single use access?

Integration with Windows subsystems such as the Cache and Memory manager will be covered to ensure the subtleties of supporting concurrent multi-data form access is not lost. As well as where to draw the line in terms of allowing the native file system to maintain some metadata information without losing robustness and flexibility in the design. We’ll answer this and more while covering the details of the design to achieve live data protection.

An understanding of the Windows layered driver model, particularly in the area of file systems and file system filters will help in understanding the topics discussed.


Re-Imagining the 3-2-1 Backup Rule for Cloud Native Applications Running on Kubernetes

Jagadish Mukku, Technical Director, Robin.io

Abstract

US-CERT in its publication of Data Backup Options in 2012 recommended 3-2-1 backup rule. Have 3 copies of data, 2 copies on different media and 1 offsite. As simple as it seems, it applies to the applications running on kubernetes to protect against the node, disk failures, manual errors, natural disasters.

In this presentation, we will present the storage architecture and various kubernetes building blocks to implement this practice. The details will include how to keep two copies of application data on two different disks (app replicated or CSI replicated storage). It will cover the intricacies and challenges of capturing the stateful app snapshot that includes kubernetes objects and persistent volume data to be able to transfer the backup copy to an offsite location. In the context of off-site copy, we will look into the storage architecture for using various cloud storage media including the object stores to enable use cases like disaster recovery, analytics and test and dev using cloud compute . It will be followed by a 3-2-1 rule demo of a complex distributed stateful app like mongodb/cassandra.


Data Protection in a Kubernetes-Native World

Niraj Tolia, CEO, Kasten

Abstract

What is Kubernetes-native backup and does one really need backup with Kubernetes? Will Kubernetes environments stay stateless forever? Why doesn’t legacy VM-backup system work with containers? This talk gets to the bottom of these questions and more!

In particular, we will cover seven critical considerations for Kubernetes-native backup and show their importance in implementing a cloud-native backup strategy that will protect your business-critical data in a developer-focused platform:

  • Kubernetes Deployment Patterns
  • DevOps and "Shift Left"
  • Kubernetes Operator Challenges
  • Application Scale
  • Protection Gaps
  • Security
  • Ecosystem Integration

We will also cover the pitfalls of trying to retrofit legacy backup architectures into a cloud-native ecosystem but, more importantly, focus on the benefits of deploying a truly cloud-native backup solution.


FC-Encryption at wirespeed

Hannes Reinecke, Kernel Storage Architect, SUSE Software Solutions

Nishant Lodha, Director of Technologies, Marvell

Abstract

FC SANs are deployed in over 90% of Fortune 1000 customer data centers that run mission-critical storage workloads. Ever increasing threat vectors and tightening regulation is driving customer in healthcare, banking and defense to better secure their storage networks. While FC SAN encryption as defined in FC-SP-2 is well defined and stable, an implementation has long since missing due to the complexities and the required encryption performance, making a hardware encryption offload the best option.

But a hardware implementation requires the complex infrastructure from the OS side, to allow for the necessary key handling and negotiation.

These mutual dependencies have long prevented any useable implementation.

In this talk Marvell and SUSE will be presenting a combined solution offloading encryption into the hardware, while having the infrastructure in place for key handling via a StrongSWAN adaption in the SUSE Linux Enterprise OS.

With this setup we're able to achieve near linespeed with encrypted FC traffic, with all key management functionality mandated by the specification.


TLS for Storage Systems

Eric Hibbard, Chair, SNIA Security TWG, Managing Director, PrivSec Consulting LLC

Abstract

Transport Layer Security (TLS), sometimes referred to as SSL (deprecated predecessor), is an important mechanism for preventing eavesdropping, tampering, and message forgery of network-based communications between clients and servers. The stream-oriented TLS is designed to run on top of a reliable transport protocol (e.g., TCP); however, the Datagram Transport Layer Security (DTLS) provides similar security guarantees for datagram-based applications. To fully exploit the security protections of TLS and DTLS, care must be exercised in selecting certain options and features (e.g., cipher suites) as well as correctly handling operational details (e.g., certificate validation and management). As with many aspects of security, TLS/DTLS must be adjusted to respond to changes in the threat landscape, so these adjustments need to be factored into TLS/DTLS implementations and use.

TLS and DTLS, to a lesser degree, are important security protocols used with many storage systems, which increasingly use RESTful APIs and Web-based management interfaces (e.g., SMI-S, CDMI, and Swordfish). This session highlights important TLS/DTLS details that are relevant to storage systems. In addition, information will be provided on recent changes and anticipated changes that could have an impact on storage infrastructures.


Trustworthy Storage - Expectations and Realities

Eric Hibbard, Chair, SNIA Security TWG, Managing Director, PrivSec Consulting LLC

Abstract

As security capabilities are added to storage technologies, storage-based systems and solutions can serve as a last line of defense in an organization’s defense in depth strategy. While these storage security developments are important, the threat landscape continues to change in negative ways, so new responses are needed. Models such as zero trust and trustworthiness have emerged as potential approached for dealing with the near-ubiquitous threats. In essence, trust nothing, verify everything, and design for failures. Easy enough to state, but the reality is fraught with many challenges.

This session highlights important storage security elements that can serve as building blocks for these models. In addition, the concepts behind zero trust and trustworthiness are explored with an eye to storage, both traditional and cloud based. Lastly, the drivers (e.g., regulations) for adopting these new models and the standards/specifications that outline what is necessary will be discussed.


Ransomware—Is it the Ultimate Malware?

Eric Hibbard, Chair, SNIA Security TWG, Managing Director, PrivSec Consulting LLC

Abstract

Malware, short for malicious software, is a blanket term for viruses, worms, trojans and other harmful software that attackers use to damage, destroy, and gain access to sensitive information; software is identified as malware based on its intended use, rather than a particular technique or technology used to build it. Ransomware is a particularly nasty version of malware that typically encrypts a victim's files and then requires the victim to pay a ransom (usually in crypto currency) to the attacker to regain access to the data upon payment (no guarantees). A more aggressive variant on this theme, which some call doxware or extortionware, goes further and threatens to release copies of private data to the public if payment is not made.

This session provides information about ransomware, including common vectors, as well as detailing some of the types of ransomware that are currently plaguing organizations. Current counter techniques are presented along with their limitations. Lastly, the storage layer is explored as a possible defensive mechanism (current and hypothetical).

File Systems

Marchive: Extending MarFS to a Long Term Archive

Garrett Ransom, Scientist, Los Alamos National Laboratory

Abstract

In response to the ever increasing bandwidth, capacity, and resiliency requirements of HPC data storage, Los Alamos National Laboratory developed MarFS, an open source filesystem providing a near-POSIX interface atop abstracted data and metadata storage implementations. For several years, the MarFS library has provided a high bandwidth, high resiliency storage tier for production data, known as Campaign Storage. The success of MarFS in this context, as well as the flexible nature of its underlying data and metadata storage, has spurred interest in extending the codebase to support the data archive needs of the Laboratory. Known as Marchive, this archive system will provide long term stability by storing parity protected data objects across magnetic tape media and expose a batch interface for efficient data ingest and retrieval.

This presentation will review the concept of MarFS, describe the extension of that concept to form a Marchive system, and relate some of the more interesting solutions to have emerged from this effort.


Tracing and visualizing file system internals with eBPF superpowers

Suchakrapani Sharma, Staff Scientist, ShiftLeft Inc

Hani Nemati, Software Engineer, Microsoft

Abstract

Linux kernel storage stack consists of several interconnected layers including Virtual File System (VFS), block layer and device driver. VFS provides the main interface to userspace applications and it is where the files and directories are being handled. As we go deep, much of the accesses are translated to actual IO operations in the block layer in the kernel. Investigating storage performance issues requires a full insight into all these layers.

In this talk, we begin by discussing the journey of a simple filesystem call from userspace all the way into the kernel. We explain how tools like Ftrace can be used to understand control flow inside the kernel. Once we understand the “points of interest” in the control flow of how the kernel handles the request from userspace, we then move on to discuss eBPF based approaches to compute meaningful storage performance/security metrics. We will showcase this with our small and nifty framework that includes a visualization system with different graphical views that represent the collected information about disk accesses in a convenient way. The goal of our talk is not just to show “yet another iotop like tool”, but to highlight the versatility of eBPF VM in the linux kernel that now allows developing targeted, plug and play tools to gather precise data about a system’s activity for security and performance debugging. To this end, we will explain in-depth what actually happens when such targeted eBPF based probing is used to extract meaningful data from the kernel. We explain the plumbing behind simple observability tools such as biolatency, vfsstat etc. [1] that have been built using eBPF and how to build a custom tool yourself.

[https://github.com/iovisor/bcc#tools]


AuriStorFS, the next generation AFS, and Linux kernel AFS

Jeffrey Altman, CEO/CTO, AuriStor, Inc.

David Howells

Abstract

AuriStorFS is a next generation AFS-family file system that transformed the thirty-five year old AFS architecture into a secure global namespace backed by a location-independent object storage infrastructure. The Linux kernel AFS/AuriStorFS client implementation, packaged by several major distributions, addresses perhaps the greatest historical weakness of AFS-family filesystems compared to NFS and SMB: the lack of a native out-of-the-box client experience.

This talk will discuss:

  • The enhancements of AuriStorFS over IBM AFS version 3 client-to-object store RPCs and the RX network transport with a focus on the design motivations.
  • The Linux kernel implementations of the AFS/AuriStorFS filesystem, AF_RXRPC socket family, fscache, and keyrings.
  • The pros and cons of the native Linux AFS/AuriStorFS kernel implementation compared to out-of-tree file system implementations such as AuriStorFS and OpenAFS.
  • The benefits of separating authorization decisions from client local identities.
  • Zero-configuration global filesystem namespaces constructed by use of DNS SRV records, the AuriStorFS location service, and AFS mount point objects.
  • Future development directions in support of containerization and over-capacity compute work flows.

Global File System View Across all Hadoop Compatible File Systems with the Light Weight Client Side In-Memory Mount Points.

Uma Maheswara Rao Gangumalla, Principal Software Engineer, Cloudera

Abstract

Apache Hadoop File System layer has integrations to many popular file systems including cloud storages like S3, Azure Data Lake Storage etc, along with in-house Apache Hadoop Distributed File System. When users want to migrate between file systems, it’s very difficult for them to update their meta storages when they persist file system paths with schemes. For Example Apache Hive persists the URI paths in metastore.

In Apache Hadoop, we came up with a solution(HDFS-15289) for this problem with the View FileSystem Overload Scheme. In this talk, we will cover in details, how users can enable it and how easily users can migrate data between file systems without modifying their metastores. It’s completely transparent to users with respective to the file paths. We will present one of the use cases with Apache Hive partitioning, that is the user can move one/some of their partition data to a remote file system and just add a mount point on the default file system(ex: HDFS) where the user was working with. Here Hive queries will work transparently from the user point of view even though the data resides in a remote storage cluster ex: Apache Hadoop Ozone or S3. This will be very useful when users want to move certain kinds of data, ex: Cold Partitions, Small Files can be moved to remote clusters from a primary HDFS cluster without affecting applications.

The Mount tables are maintained at the central server, all clients will load the tables while initializing the file system and also can refresh on modification of mount points. So, that all the initializing clients will be in sync. This will make user’s life easier to migrate data between cloud and on-premise storages in a much flexible way.


NoLoad Filesystem: A stacked filesystem for NVMe-Based Computational Storage

David Sloan, Principle Engineer, Eideticom

Logan Gunthorpe, Principle Engineer, Eideticom"

Stephen Bates, Chief Technology Officer, Eideticom

Abstract

Computational Storage is an emerging technology that aims to make computer systems more efficient by moving certain parts of applications closer to the storage layer. In compute tasks such as compression, encryption, error-detection, and error-correction the target application does not need to be made aware of any of the underlying details of the acceleration target. In these cases inserting the acceleration between the application and the operating system is an ideal solution as it removes the development effort required to integrate an acceleration into a new software. Stacked filesystems are a good candidate for inserting new compute elements between applications and storage devices. They allow transparent access from the applications perspective and full architecture flexibility for system administrators who can use existing filesystem and raid configurations. In this presentation we will demonstrate a stacked filesystem approach to data compression/decompression which allows for high-speed hardware acceleration as a transparent layer between application and storage devices. Using this approach, all status reporting and acceleration control is achieved through existing Linux filesystem APIs.


Boosting OpenZFS Metadata Performance with Special Allocation Classes: A Case Study with TrueNAS Fusion Pools

Nick Principe, Platform and Performance Engineering Supervisor, iXsystems

Abstract

One of the many exciting features introduced in TrueNAS CORE, the new FreeNAS, is special allocation classes - or, as we’re calling them, Fusion Pools. This feature allows ZFS metadata, dedupe tables, and even small data blocks to be stored on a faster tier of storage, such as NVMe flash, than the bulk of the data in the storage pool, which could reside on hard drives or less expensive QLC flash. In this session, we’ll go over the basic operation of Fusion Pools in the OpenZFS data flow, the various options available for Fusion Pools, and finally show the performance advantages of using Fusion Pools in a few different scenarios.


Amazon FSx For Lustre Deep Dive and its importance in Machine Learning

Suman Debnath, Principal Developer Advocate Amazon Web Services

Abstract

Amazon FSx for Lustre, a fully managed service that makes it easy and cost-effective for AWS customers to launch and run a Lustre high performance file system for their data-intensive applications. In this talk I shall introduces you to the features and benefits of the service, such as its massively scalable performance, seamless integration with Amazon S3(object storage), and compatibility with customer applications. Several use cases in particular, we will see how this file system can accelerates and simplifies training Machine Learning models.

Keynotes

Analog Memory-based techniques for Accelerating Deep Neural Networks

Sidney Tsai, Research Staff Member, Manager, IBM

Abstract

Deep neural networks (DNNs) are the fundamental building blocks that allowed explosive growth in machine learning sub-fields, such as computer vision and natural language processing. Von Neumann-style information processing systems are the basis of modern computer architectures.As Moore's Law slowing and Dennard scaling ended, data communication between memory and compute, i.e. the “Von Neumann bottleneck,” now dominates considerations of system throughput and energy consumption, especially for DNN workloads. Non-Von Neumann architectures, such as those that move computation to the edge of memory crossbar arrays, can significantly reduce the cost of data communication.

Crossbar arrays of resistive non-volatile memories (NVM) offer a novel solution for deep learning tasks by computing matrix-vector multiplication in analog memory arrays. The highly parallel structure and computation at the location of the data enables fast and energy-efficient multiply-accumulate computations, which are the workhorse operations within most deep learning algorithms. In this presentation, we will discuss our Phase-Change Memory (PCM) based analog accelerator implementations for training and inference. In both cases, DNN weights are stored within large device arrays as analog conductances. Software-equivalent accuracy on various datasets has been achieved in a mixed software-hardware demonstration despite the considerable imperfections of existing NVM devices, such as noise and variability. We will discuss the device, circuit and system needs, as well as performance outlook for further technology development.


Introducing SDXI(Smart Data Acceleration Interface): A new SNIA TWG to standardize a memory to memory data movement and acceleration interface

Shyamkumar Iyer, Distinguished Member of Technical Staff, Dell Office of CTO, Dell

Richard Brunner, Principal Engineer and CTO of Server Platform Technologies, VMware

Abstract

Software-based memory-to-memory data movement is common, but takes valuable cycles away from application performance. At the same time, offload DMA engines are vendor-specific and may lack capabilities around virtualization and user-space access.

This talk will focus on how SDXI(Smart Data Acceleration Interface), a newly formed SNIA TWG is working to bring an extensible, virtualizable, forward-compatible, memory to memory data movement and acceleration interface specification.

As new memory technologies get adopted and memory fabrics expand the use of tiered memory, data mover acceleration and its uses will increase. This TWG will encourage adoption and extensions to this data mover interface.


Unlocking the Potential of Flash with the New Open Source Software-Enabled Flash™ API

Eric Ries, SVP Memory and Storage Strategy Division, KIOXIA America Inc., KIOXIA America, Inc.

Abstract

Hyperscale applications are constantly redefining storage requirements for greater efficiency at cloud scale. Flash as a digital media opens new opportunities to the hyperscale architects and developers for fine-grained control over performance, latency, guaranteed quality of service, and the ability to define data placement and isolation of workloads from the host or the application. KIOXIA (formerly Toshiba Memory) has released technology to the open source community that redefines digital storage, combining software flexibility with purpose-built hardware into a new flash native API that maximizes the value of flash memory for cloud providers. This combination of technologies fundamentally redefines the relationship between the host and solid-state storage devices and bypasses legacy storage paradigms, unlocks host control and enables the use of flash natively at maximum speed.

What this means for cloud developers, storage architects and innovators is a new set of tools is available for defining and deploying flash rapidly and efficiently. The API abstracts the low level media management, presenting a higher-level, application centric view of flash with code compatibility across different flash technologies, vendors, and generations. This means shortening of development times for new services, reuse of code and rapid time to market for each new generation of flash. KIOXIA will present this new flash-native technology, Software-Enabled Flash™, and show how it provides an open source framework for developers to solve storage challenges at hyperscale.


Storage Next: Disruption, Opportunities and Funding Insights

Parag Kulkarni, Chief Operating Officer, Calsoft

Vipin Shankar, VP of Engineering, Calsoft

Abstract

Storage technology is gradually pivoting towards an all-Flash, software-defined, AI-driven, and disruptive model to better manage the critical dimensions of data, i.e. storage, protection, and analytics. This disruption in the Storage technology has propelled organizations to achieve the next level of cost-saving, efficiency, and an environment that is predictive and supports future data growth and scalability needs. While it’s tough to predict the future, this keynote attempts to throw ample light on the overall health of the Storage industry, the prevailing micro trends, ongoing disruptions, venture funding and sector analysis, and the emerging market opportunities to focus on.


Amazon Elastic File System: Building Blocks for a Cloud-Native File System

Jacob Strauss, Principal Engineer, Amazon

Abstract

Amazon Elastic File System is a scalable, elastic, and cloud-native NFS file system. This talk will describe the block layer within EFS and why we incorporated concurrency primitives into the interface that upper layers use to construct files and directories. The block layer presents a logical namespace that lets callers use a simple transactional interface that hides topics such as consensus, replication, recovery, and most types of coordination. Customer applications require a wide variety of latency and throughput combinations. Creating one fully distributed and one concentrated implementation of the same EFS block interface allows us to support two different performance profiles on the same underlying storage.


What’s going on with NVMe? An Examination of New Technology Adoption

Mike Scriber, Sr. Director, Server Solution Management, Supermicro

Abstract

I will discuss the adoption of NVMe in our industry and discuss what is happening with NVMe. How are CPUs going to change NVMe? Where is EDSFF going? Why do we need NVMeoF.


Caching on PMEM: an Iterative Approach

Yao Yue, Sr. Staff Software Engineer, Twitter

Juncheng Yang, Carnegie Mellon University & Twitter

Abstract

With PMEM boasting a much higher density and DRAM-like performance, applying it to in-memory caching such as memcached seems like an obvious thing to try. Nonetheless, there are questions when it comes to new technology. Would it work for our use cases, in our environment? How much effort does it take to find out if it works? How do we capture the most value with reasonable investment of resource? How can we continue to find a path forward as we make discoveries? At Twitter, we took an iterative approach to explore cache on PMEM. With significant early help from Intel, we started with simple tests in memory mode in a lab environment, and moved on to app_direct mode with modifications to Pelikan (pelikan.io), a modular open-source cache backend developed by Twitter. With positive results from the lab runs, we moved the evaluation to platforms that more closely represent Twitter’s production environment, and uncovered interesting differences. With better understanding of how Twitter’s cache workload behaves on the new hardware, and our insight into Twitter’s cache workload in general, we are proposing a new cache storage design called Segcache that, among other things, offers flexibility with storage media and in particular is designed with PMEM in mind. As a result, it achieves superior performance and effectiveness when running either on DRAM or PMEM. The whole exploration was made easier by the modular architecture of Pelikan, and we added a benchmark framework to support the evaluation of storage modules in isolation, which also greatly facilitated our exploration and development.


Introduction to Virtual SDC

Michael Oros, Executive Director, SNIA

Abstract

Introduction to Virtual SDC


Introduction to SNIA

Dr. J Metz, Chairman, SNIA

Abstract

Introduction to SNIA

Key Value

Key Value Standardized

William Martin, SSD I/O Standards, Samsung

Abstract

The NVMe Key Value (NVMe-KV) Command Set has been standardized as one of the new I/O Command Sets that NVMe Supports. Additionally, SNIA has standardized a Key Value API that works with the NVMe Key Value allows access to data on a storage device using a key rather than a block address. The NVMe-KV Command Set provides the key to store a corresponding value on non-volatile media, then retrieves that value from the media by specifying the corresponding key. Key Value allows users to access key-value data without the costly and time-consuming overhead of additional translation tables between keys and logical blocks. This presentation will discuss the benefits of Key Value storage, present the major features of the NVMe-KV Command Set and how it interacts with the NVMe standards, and present open source work that is available to take advantage of Key Value storage.


A low latency and scalable key value store from modern off the shelf components

Daniel Pollack, CTO, Data Storage Science LLC

Abstract

The current state of off the shelf software defined storage systems and high performance hardware is enabling simplified high performance systems that used to be complex and difficult to scale. By combining simple mainstream technologies like object storage systems with NVMeoF storage hardware it is possibly to implement a very low cost, very low latency, and very scalable key value store. The benefits include ease of deployment and management, use of open source software, up to date OS integration, and commodity component costs. The awareness of common use cases and past development efforts have resulted in the development of simpler and more attainable solutions for performance and scale using these components. I have built a demonstration based on MinIO and NVMeoF/TCP that has comparable latency performance to in-memory key value stores with the cost and scale associated with disaggregated flash storage systems.

NVMe

pynvme: an open, fast and extensible NVMe SSD test tool

He Chu, Engineer, Geng Yun Technology Pte. Ltd.

Abstract

SSD is becoming ubiquitous in both Client and Data Center markets. The requirements on function, performance and reliability are refreshed frequently. As a result, SSD design, especially the firmware, has been keeping upgrading and restructuring for the decade.

The test makes the change under control. However, the firmware test is not as mature as the software test. We have well developed methodologies, processes and tools for software. But the embedded platform, where the firmware executes, only provides the limited resources on computation and memory. So, it is difficult to run full test in the native embedded environment. Practically, SSD vendors run system tests with 3-rd party software, consuming huge resources. The existed tools lacks the flexibility to make efficient tests against vendor's own features and flaws. SSD developers need an infrastructure to implement their test scripts or programs in low cost. Our pynvme is just the answer.

Pynvme is open. It is not only an open-source project, but also a testing solution utilizing the open software ecosystem. We can use the mature testing software in the cloud era in SSD testing. Pynvme is very fast, even faster than FIO. It is based on a user-space driver which accesses NVMe drives directly and bypasses the overhead of the whole storage software stack in Linux kernel. Pynvme is extensible. We can access any PCIe configuration and BAR space to implement our own test dedicated NVMe driver in Python scripts. Based on pynvme, test developers can write and deploy test scripts efficiently with lower software and hardware budget.


libnvme: An open source library for NVM Express

Keith Busch, Technologist, Western Digital Corporation

Abstract

The NVM Express workgroup is introducing new features frequently, and the Linux kernel supporting these devices evolves with it. These ever moving targets create challenges when developing tools when new interfaces are created, or older ones change.

This talk will provide information on some of these recent features and enhancements, and introduce the open source 'libnvme' project which implements an open source library available in public git repositories that provides access to all NVM Express features with convenient abstractions to the kernel interfaces interacting with your devices.

The session will demonstrate integrating the library with other programs, and also provide an opportunity for the audience to share what additional features they would like to see out of this common library in the future.


Future Proof Your Data Center with NVMe™ Technology

John Kim, Director of Storage Marketing, Networking Business Unit, NVIDIA

Abstract

As data centers evolve to contain time-sensitive, high-value storage, it is increasingly important that data center SSDs and all-NAND flash arrays (AFA’s) are designed with future-proof technology. Enter the NVM Express specifications.

Since its first release in 2011, the NVMe™ specification has evolved into the industry standard for PCIe SSDs. The NVMe-oF™ specification, first released in 2016, allowed NVMe technology to extend to additional transports beyond PCIe architecture, such as Ethernet, Fibre Channel, InfiniBand, RoCE, and TCP/IP. This breakthrough allowed further proliferation of NVMe technology’s low latency, scalable, flash storage technology.

In this presentation, attendees will receive a technical dive into the key emerging NVMe and NVMe-oF technology features that are crucial for future-proofing the data center.

The latest NVMe 1.4 specification provides improved quality of service (QoS), faster performance, high availability improvements, and scalability upgrades. Attendees will learn how to take advantage of new features such as Zoned Namespaces (ZNS), Endurance Groups, Sanitize enhancements, Rebuild Assist, Persistent Event Log, NVM Sets, IO Determinism, Multipathing, Asymmetric Namespace Access (ANA) and more, which are catered to meet the needs of the evolving data center. Attendees will also learn about the new features in NVMe-oF 1.1 specification and how they will have a significant impact on ROI and overall performance for networked storage.


Copy offload meets NVMe

Chaitanya Kulkarni Principal Engineer, Western Digital Corporation

Abstract

Copy offload is a feature that allows instructing storage devices to copy sectors internally, without data transfer to and from the host nor any involvement of the host CPU to issue multiple read & write IO requests which would otherwise require data transfers between host and device.

Single-thread performance is limited due to Dennard (MOSFET) scaling law and multi-thread performance increase is slowing down due to Moore's law limitations. That means every CPU cycle counts now, and copy offload can help reducing CPU overhead for IO intensive applications.

With the rise of SNIA Computation Technical Storage Working Group (TWG), offloading computations to a local device or to a remote device over a fabric is becoming popular as several solutions are already available. Copy offload is a simple but high value offload operation standardized for several command protocols (e.g. SCSI) and available with different products. The NVMe technical working group (TWG) also recently ratified the Simple Copy Command interface.

This talk focuses on this latest incarnation of copy offload ratified by the NVMe TWG. The presentation will first go over some basic aspects of Linux I/O stack and NVMe subsystem. The simple copy command user interface and tools will then be discussed to provide the audience with a clear overview of the potential advantages for applications and how future applications can use new design patterns with copy offload in mind. Performance results for micro benchmarks of data copy on a direct attached NVMe device as well as for a fabric attached device will show the performance gains that can be expected from copy offload.


NVMe 2.0 Specification Preview

William Martin, SSD I/O Standards, Samsung

Jonmichael Hands, Senior Strategic Planner and NVM Express Marketing Work Group Chair, Intel

Abstract

NVMe is the fastest growing storage technology of the last decade and has succeeded in unifying client, hyperscale and enterprise applications into a common storage framework. NVMe has evolved from a being a disruptive technology to becoming a core element in storage architectures. In this session, we will talk about the NVMe transition to a merged base specification inclusive of both NVMe and NVMe-oF architectures. We will provide an overview of the latest NVMe technologies, summarize the NVMe standards roadmap and describe the latest NVMe standardization initiatives. NVMe technology will present a number of areas of innovation that preserve our simple, fast, scalable paradigm while extending the broad appeal of NVMe architecture. These continued innovations will ready the NVMe technology ecosystem for yet another period of growth and expansion.


NVMe-oF

High-performance RoCE/TCP solutions for end-to-end NVMe-oF communication

Jean-Francois Marie, Chief Solution Architect, Kalray

Abstract

Abstract: Exploiting the full SSD performance in scalable disaggregated architectures is a continuous challenge. NVMe/TCP, released in 2018, enables a broader sharing of distributed storage resources. It complements NVME-oF over RDMA, avoiding performance degradation over distant links and simplifying the deployment. However this comes at the cost of an heaviest networking stack and requires the latest Linux kernels. In this talk, we will analyze the differences between RoCE and TCP, and show how to eliminate bottlenecks, achieving best-in-class performance for both protocols in an end-to-end NVMe-oF communication. We will demonstrate also how this solution can be OS agnostic, ensuring a seamless integration of NVMe-oF in today datacenter.


NVMe over Fabrics in the Enterprise

Rupin Mohan, Director R&D, CTO SAN, HPE

Abstract

This session will discuss application and use case examples leveraging the NVMe 1.4 and NVMe-oF 1.1 specifications. Get a unique perspective on how NVMe technology and NVMe over Fabrics is evolving to redefine next generation SAN and the key fabric requirements to enable this new frontier in the next generation enterprise data centers. This session will cover:

  • Shift of NVMe drives from inside servers to outside the servers – disaggregated storage
  • Second order effect of this, would be like its 1999 ….Concept of where NVMe over Fabrics is versus Fibre Channel circa 1996-1999
  • Introduce the idea of a centralized discovery controller and how the industry is getting together led by HPE on the need for centralized name services and the concept of a ‘fabric’ which is missing in Ethernet right now
  • The opportunity to drive the same technology across on-prem, hybrid and cloud networks in terms of storage networking
  • Lastly, the concept of NVMe over Fabrics connected drives and how the new storage architectures will need to have even a bigger focus and reliance on the storage fabric and the fabric will be ubiquitous and will need to be a single fabric and there will be synergies across both front-end, back-end and inside the storage controllers…

Optimizing user space NVMe-oF TCP transport solution with both software and hardware methodologies

Ziye Yang, Staff Cloud Software Engineer, Intel

Yadong Li, Lead software architect, Intel

Abstract

In this talk, we would like to update the development status of SPDK user space NVMe/TCP transport and the performance optimizations of NVMe/TCP transport in both software and hardware areas. In the recent one year, there are great efforts to optimize the NVMe-oF transport performance in software especially with the kernel TCP/IP stack, such as (1) Trade-off memory copy cost to reduce system calls to achieve optimal performance of the NVMe/TCP transport on top of the kernel TCP/IP stack; (2) Use asynchronized writev to improve the IOPS; (3) Use libaio/liburing to implement group based  I/O submission for write operations. We also spent some efforts to investigate user space TCP/IP stack (e.g., Seastar) to explore the performance optimization opportunity.  In this talk, we also share Intel’s latest effort to optimize the NVMe/TCP transport in SPDK using Application Device Queue (ADQ) technology from Intel 100G NICs, which improves NVMe/TCP transport performance significantly.  We will talk about how SPDK can export the ADQ feature provided by Intel's new NIC into our common Sock layer library to accelerate the NVMe-oF TCP performance and share the performance data with Intel's latest 100Gb NIC (i.e., E810).  ADQ significantly improves the performance of NVMe/TCP transport in SPDK including reduced average latency, significant reduction in long tail latency, and much higher IOPS.


NVMe-oF on RDMA performance challenges and solutions in commodity servers

Yamin Friedman, Architecture engineer, Nvidia

Rob Davis, VP of storage technology, Nvidia

Abstract

NVMe-oF has become a very successful new networked storage protocol in both the Cloud and Enterprise storage markets. But in order to provide the highest performance NVMe-oF solutions, it is necessary to ensure that all parts of the system are working together as effectively as possible. From the network perspective, industry data has shown that NVMe-oF on RDMA can provide the highest bandwidth and lowest consistent latencies. The goal of this talk is to point out some limitations in commodity servers that can in certain circumstances prevent todays NICs from achieving even higher performance. We will also present the solutions to these limitations that we have incorporated into the NVMe-oF Linux kernel driver. These solutions include, Dynamic Interrupt Moderation and shared Completion Queues. This work demonstrates the importance of continued optimizations to achieve the maximum performance from the commodity server and NIC hardware currently available.


Adaptive Distributed NVMe-oF Namespaces

Scott Peterson, Senior Software Engineer, Intel

Abstract

While NVMe-oF excels at delivering IO to a single remote target in both performance and efficiency, this model of storage devices is inconsistent with large scale storage systems.

Storage systems providing rich volume services from distributed pools of storage devices tend to spread logical volumes across many devices in many storage nodes. Host IO must then be delivered to logical volume extents in many storage nodes. Common solutions include bespoke storage clients installed in every host, or dedicated storage gateways to adapt standard storage protocols to the distributed storage service.

Intel's Adaptive Distributed NVMe-oF Namespaces (ADNN) is an NVMe-oF extension that enables distributed volumes to be accessed via NVMe-oF in a single fabric hop. Gateways and bespoke storage clients are eliminated.

We'll show how ADNN enables hosts to learn which of several equivalent NVMe-oF targets contains each region of a namespace, and adapt to placement changes at runtime.

We’ll describe how the ADNN architecture can be used in a single storage node for NUMA optimized placement, and in large scale-out cloud storage systems like Ceph.

We’ll show our experimental results using ADNN to access Ceph RBD images. You’ll see how the ADNN hash hint results in host IO delivered directly to the correct OSD node via NVMe-oF, and how the CPU utilization in the host is reduced by eliminating the Ceph block client. We’ll show the results of our hash hint latency POC, which demonstrates that applying the hash hint adds very little latency. We’ll explain how the ADNN hash hint splits the Ceph CRUSH process leaving only the actual hash function to be performed in the NVMe-oF host data path (and the rest performed via Ceph CLI tools only when the cluster map changes).

Finally we’ll outline our plans for upstreaming the ADNN reference implementation to SPDK, its current status, and how you can try out the SPDK ADNN building blocks yourself.


Improving NVMe/TCP Performance by Enhancing Software and Hardware

Sagi Grimberg, CTO, lightbits

Anil Vasudevan, Architect, Intel

Abstract

Since ratification, NVMe/TCP has proven to be a viable fabric for NVMe storage disaggregation. The combination of the high performance and low latency of NVMe with the scale and simplicity of TCP makes a great fit for large scale data center storage technology and common Ethernet network practices.

In this session, we will cover recent improvements to the NVMe/TCP Linux drivers as well as the adoption of Application Device Queues (ADQ) technology that results in even higher performance, lower latency and reduction in CPU utilization.

In addition, we will cover a real-life example of the advantages of NVMe/TCP with ADQ implemented in LightOS v2.0, a scaleout SDS solution that offers high-availability, intelligent flash management and advanced data-services as well as industry leading performance.


Tuning and Optimizing Ethernet-based NVMe over Fabric transport Protocols

Dave Minturn, Principal Engineer, Intel

Anil Vasudevan, Architect, Intel

Abstract

NVMe-oF can be accomplished using three Ethernet-based storage transport protocols: iWARP RDMA, RoCEv2 RDMA and NVMe/TCP. This session will describe target use cases and what it takes to optimize performance across these three transports. In-depth performance data, based on the Linux kernel and Storage Performance Development Kit(SPDK) will be shared and analysed. Additional industry technology like Application Device Queues (ADQ), which can further improve NVMe/TCP performance to within the same range as the RDMA protocols, will be discussed as well to demonstrate additional enhancement for NVMe/TCP implementations.


Future Proof Your Data Center with NVMe™ Technology

John Kim, Director of Storage Marketing, Networking Business Unit, NVIDIA

Abstract

As data centers evolve to contain time-sensitive, high-value storage, it is increasingly important that data center SSDs and all-NAND flash arrays (AFA’s) are designed with future-proof technology. Enter the NVM Express specifications.

Since its first release in 2011, the NVMe™ specification has evolved into the industry standard for PCIe SSDs. The NVMe-oF™ specification, first released in 2016, allowed NVMe technology to extend to additional transports beyond PCIe architecture, such as Ethernet, Fibre Channel, InfiniBand, RoCE, and TCP/IP. This breakthrough allowed further proliferation of NVMe technology’s low latency, scalable, flash storage technology.

In this presentation, attendees will receive a technical dive into the key emerging NVMe and NVMe-oF technology features that are crucial for future-proofing the data center.

The latest NVMe 1.4 specification provides improved quality of service (QoS), faster performance, high availability improvements, and scalability upgrades. Attendees will learn how to take advantage of new features such as Zoned Namespaces (ZNS), Endurance Groups, Sanitize enhancements, Rebuild Assist, Persistent Event Log, NVM Sets, IO Determinism, Multipathing, Asymmetric Namespace Access (ANA) and more, which are catered to meet the needs of the evolving data center. Attendees will also learn about the new features in NVMe-oF 1.1 specification and how they will have a significant impact on ROI and overall performance for networked storage.


Next-Gen NVMe-oF Reference System: From Media to Network

Duckho Bae, Principal Engineer, Samsung Electronics

Jungsoo Kim, Staff Engineer, Samsung Electronics

Abstract

NVMe-oF, as part of composable disaggregated infrastructure (CDI) for storage solution, is highly promising as it allows efficient storage architecture by pooling and sharing storage resources while offering high performance and low latency IO at scale. The advent of next-generation architecture which requires more power, higher density, higher throughput, and finer QoS control brings stringent requirements for storage solution. To overcome those challenges, end-to-end optimization from storage media to network is inevitable. In this upcoming talk, we would like to introduce open-source based NVMe-oF reference solution which consists of shared storage sever and user-space software stack. We will discuss a novel server architecture which utilizes the latest PCIe Gen4 based EDSFF SSD form factor. We especially want to discuss why and how the adoption of new SSD form factor can optimize the NVMe-oF solution. Moreover, we will introduce the user-space software stack for NVMe-oF solution that enables high throughput IO and stable QoS control. Finally, we will share our strategies to deliver media-level optimization for NVMe-oF solution.

Orchestration

SODA - One Data Framework, Infinite Possibilities

Rakesh Jain, Senior Technical Staff Member, IBM Research, IBM

Anjaneya Chagam, Cloud Architect, Intel Corporation

Abstract

SODA Foundation is an open source project within the Linux Foundation that aims to foster an ecosystem of open source data management and storage software for data autonomy. It aims to include multiple open source projects from different communities which are related to data management. The SODA data framework is part of this effort to enable data mobility, data protection, data security, data lifecycle, and more for cloud native, virtualization, and other environments. In this talk, we will introduce SODA Foundation, its history, use cases to addresses, roadmap and mechanism for other open source projects to become part of the community.


Unified heterogeneous storage monitoring : Is delfin a way forward?

Najmudheen CT, Architect Huawei, Maintainer SODA Foundation, Huawei Technologies India Pvt Ltd

Masanori Itoh, Principal Researcher, Toyota Motors / TSC Member SODA foundation, Toyota

Abstract

Storage Administrators having challenges in monitoring data centres with rapid data growth combined with heterogeneous devices and increased service level expectations.

Every vendor has monitoring solutions with different views of storage infrastructure.

Some of the unified information’s which helps administrator to plan his infrastructure are:

  • Capacity by trend (Used for Block, File, Object)
  • Capacity by purpose ( Primary ,Local Replica, Remote replica)
  • Capacity by service level
  • SLA bottle neck in the data path.

There are vendor specific solutions for such needs with some support for other vendors as well (specific and custom based on the original vendors!). However this is not unified or collaborative or open! A unified open solution can help the storage ecosystem to collaborate to make it standard going forward. Let us explore SODA Foundation delfin architecture and roadmap whether it is inline and how much it can help in this regard.

This session is to discuss how unified data models can be built for different concepts and KPIs across multi vendors. It discusses the architecture tenets of delfin and provides the technical insights. The session also tries to bring the open points and challenges for the same. It opens up a discussion on the topic which can help to get technical views to collaborate in future.


Predictive analysis of storage health and performance for heterogeneous environment

Najmudheen CT, Architect Huawei, Maintainer SODA Foundation, Huawei Technologies India Pvt Ltd

Xulin, Architect Huawei, Maintainer, SODA Foundation

Abstract

The modern data centre is an incredibly complex system having heterogeneous devices. Keeping that complex system up and running is difficult enough without accounting for potential problems that might arise in the future. Fortunately, the last several years have seen significant improvements in predictive analytics. But the challenges are around monitoring, managing and predicting events in an IT infrastructure.

This session is to give a case study of how "telemetry" and "anomaly detection" can be put together to solve data centre predictive analysis problems using available open source solutions, especially for a heterogeneous environment with different vendor storages or application clients The data and discussion are based on the actual development and experiments on open source projects under SODA Foundation. We explore the various requirements on heterogeneous storage health and monitoring requirements (like different clients, different predictive algos, different storages, different visualizations and more).


Resilient Workflow Automation in a Hybrid Cloud environment

Ashit Kumar, Architect, SODA Foundation/Huawei

Joseph Vazhappilly, Senior System Architect, SODA Foundation/Huawei

Abstract

Every service in the on-prem, cloud or hybrid environments is a sequence of specific operations to realize a specific use case. Hence building custom workflows for each use case is demanding. If it is with a framework that is orchestration engine agnostic, then it is really compelling!

Currently, there are different Orchestration engines (Stackstorm, Argo, etc..) used based on the vendors, products, and environments. Each orchestration engine, demands specific syntactic and semantic rules for workflow development. This makes it harder to build and manage services in heterogeneous environments.

In this session, the authors present a unified orchestration and automation framework which is completely open-source and which can support heterogeneous orchestration engines in a pluggable extension way. Moreover, it provides custom workflow development and deployment agnostic to those orchestration backends.

The project has been developed under SODA Foundation, and it provides the capability to:

  1. Create Service catalog i.e Workflow definition and dynamically connect to the desired orchestration engine
  2. There can be multiple ‘services instance’ created for a particular catalog, but when the orchestration engine changes, the service instance is un-affected
  3. Scale to different orchestration engines for different service instances of a particular service catalog. Ex: WF definition is to “spawn instances in cloud VPC with a particular startup script” in three different cloud providers using three different Orchestration engines
  4. Custom workflows and configurable automation

Autonomous Data Management at Edge: Challenges and possibilities.

Sanil Kumar D, Chief Architect, TOC, Head SODA India, SODA Foundation / Huawei, Huawei Technologies India Pvt Ltd

Vinod Eswaraprasad, Chief Architect, Global Head of Cloud & Platform Practice, Wipro

Abstract

Edge Cloud Computing usually talks about compute, memory, and latency. However, for large deployments of Edge Cloud, the data at the edge is critical for edge computing be possible. Any typical edge cloud could contain heterogeneous hardware, platforms, and of course distributed heterogeneous storage.

Sanil and Vinod from their experience working in Edge Computing, Cloud Native, Data Management, Storage, and CSI, provide methods and architecture to bring the data autonomy to edge computing. To illustrate the architecture and proposal, they use heterogeneous storage management solution models from SODA Foundation projects. The session discusses the demanding data management requirements at the edge, challenges, and opportunities. They think this session can trigger more technical thoughts to build an open and autonomous data management framework for Edge.

Persistent Memory

Mortimer: A high performance scale out storage for persistent memory and NVMe SSDs

Anjaneya Chagam, Cloud Architect, Intel Corporation

Abstract

Mortimer is an open source software that is designed from ground up to take advantage of byte addressable persistent memory to deliver high performance low latency storage. Mortimer uses persistent memory for meta-data lookups and fast write buffering. Buffered writes from persistent memory are flushed to NVMe SSDs in the background. Highly optimized lock-less algorithms are used to exploit DRAM band width while taking advantage of byte-addressable persistent storage for meta-data durability. Data path is optimized using NVMeoF with distributed control plane and data plane extensions to deliver seamless application integration. Poll mode, lockless design pattern is adapted for entire data path to achieve optimum usage of compute resources. Mortimer is built on top of existing open source development kits (SPDK, PMDK etc.) and adapts proven open source techniques such as consistent hashing, consensus protocols to deliver distributed storage semantics. This session covers Mortimer architecture, algorithms, roadmap and benchmarking data to demonstrate how persistent memory is used to deliver low latency scale out storage.


Persistent Memory + Enterprise-Class Data Services = Big Memory

Charles Fan, CEO and co-founder, MemVerge

Abstract

Data-centric applications such as AI/ML, IoT, Analytics and High Performance Computing (HPC) need to process petabytes of data with nanosecond latency. This is beyond the current capabilities of in-memory architectures because of DRAM’s high cost, limited capacity and lack of persistence. As a result, the growth of in-memory computing had been throttled as DRAM is relegated to only the most performance-critical workloads.

In response, a new category of Big Memory Computing has emerged to expand the market for memory-centric computing. Big Memory Computing is where the new normal is data-centric applications living in byte-addressable, and much lower cost, persistent memory. Big Memory consists of a foundation of DRAM and persistent memory media plus a memory virtualization layer. The virtualization layer allows memory to scale-out massively in a cluster to form memory lakes, and is protected by new memory data services that provide snapshots, replication and lightning fast recovery.

The market is poised to take off with IDC forecasting revenue for persistent memory to grow at an explosive compound annual growth rate of 248% from 2019 to 2023.


Update on the JEDEC DDR5 NVRAM Specification

Bill Gervasi, Principal Systems Architect, Nantero

Abstract

A generation of new non-volatile memories (NVMs) potentially capable of working with, or replacing, SDRAM are in design now. Memory controller designers will want to exploit the advantages of these new memories, such as zero-power standby and the elimination of content refresh. JEDEC is in the process of defining the DDR5 NVRAM specification, the first of the detailed definitions of these new NVMs, as an enablement for systems designers to learn about and be ready for the coming wave.


Challenges and Opportunities as Persistence Moves Up the Memory/Storage Hierarchy

Jim Handy, General Director, Objective Analysis

Thomas Coughlin, President, Coughlin Associates

Abstract

While the storage industry is wrestling to incorporating persistent memory in its system and software designs, even bigger challenges lurk around the corner. Systems include a lot of other memory besides DIMMs. The near future will bring persistent caches, and even persistent registers! As CPUs move to smaller semiconductor processes, MRAM and other emerging memory types will replace on-chip SRAM to completely change the nature of cache memory. From here, it will be a small step to make persistent internal CPU registers. With all of these changes, software can not only be designed to perform better as a result of the persistence, but will need to be designed in a way that prevents persistence from undermining security and reliability. This presentation, by the authors of an annual research report on emerging memories, will show how and why memory at all levels will become persistent and will reflect on problems that must be solved both to use it effectively, and to prevent persistence from causing trouble.


Exploring New Storage Paradigms and Opportunities with Persistent Memory Technology

Daniel Waddington, Principal Research Staff Member, IBM

Abstract

Emerging persistent memory technologies, such as sub-microsecond non-volatile DIMMs and in/near-memory compute, are creating new opportunities for application performance improvement by enabling memory-storage convergence. This convergence defines a new paradigm by blurring the boundaries between compute-data and stored-data. In turn, the need to transform and move data (locally or across the network) can be dramatically reduced, leading to orders of magnitude improvement in performance over existing methods.

In this talk, we will explore memory-technology trends and the associated system architecture challenges. We will highlight our current work at IBM Research, known as MCAS, to develop a new converged architecture. MCAS (Memory Centric Active Storage) evolves the conventional key-value paradigm to enable seamless data movement and arbitrary in-place operations on structured data in memory, while also providing traditional storage capabilities such as durability, versioning, replication and encryption.


Is Persistent Memory Persistent?

Terence Kelly

Haris Volos

Abstract

Preserving application data integrity is a paramount duty of computing systems. Failures such as power outages are major perils: A sudden crash during an update may corrupt data or effectively destroy it by corrupting metadata. Applications protect data integrity by using update mechanisms that are atomic with respect to failure; such mechanisms promise to restore data to a consistent state following a crash.

Unfortunately, the checkered history of failure-atomic update mechanisms precludes blind trust. Widely used relational databases and key-value stores often fail to uphold their transactionality guarantees [Zheng et al., OSDI '14]. Lower on the stack, durable storage devices may corrupt or destroy data when power is lost [Zheng et al., FAST '13]. Emerging non-volatile memory (NVM) hardware and corresponding failure-atomic update mechanisms strive to avoid repeating the mistakes of earlier technologies, as do software abstractions of persistent memory for conventional hardware [the topic of my SDC 2019 talk]. Healthy skepticism, however, demands firsthand evidence that such systems deliver on their integrity promises.

Prudent developers and operators follow the maxim, "train as you would fight." Software that must tolerate abrupt power failures should demonstrably survive such failures in pre-production tests or "Game Day" failure-injection testing on production systems. In the past, my colleagues and I extensively tested our crash-tolerance mechanisms against power failures, but we did not document the tribal knowledge required to practice this art.

This talk describes the design and implementation of a simple and cost-effective testbed for subjecting applications running on a complete hardware/software stack to repeated sudden whole-system power interruptions. The testbed is affordable, runs unattended indefinitely, and performs a full power-off/on test cycle in one minute. The talk will furthermore present my findings when I used such a testbed to evaluate a crash-tolerance mechanism for persistent memory by subjecting it to over 50,000 power failures. Any software developer can use this type of testbed to evaluate crash-tolerance software before releasing it for production use. Application operators can learn from this talk principles and techniques that they can apply to power-fail testing their production hardware and software.

A peer-reviewed companion paper that covers all of the material in the talk and that provides additional detail will be published prior to the talk; attendees are invited but not required to read the paper before the talk.


How can persistent memory make database faster, and how could we go ahead?

Takashi Menjo, Researcher, Nippon Telegraph and Telephone Corporation

Abstract

Persistent memory (PMEM) is a fast-to-access non-volatile memory device by itself. To get the best out of it, however, we need to modify software design as PMEM-aware. In this presentation, I will talk about my case study to improve the transaction processing performance of PostgreSQL, an open-source database management system (DBMS). I redesigned a typical two-level transaction logging architecture that had consisted of DRAM and persistent disk like HDD or SSD, into a single-level one on PMEM.

Database researchers and engineers have optimized logging architectures, assuming that persistent disk has been slower than main memory and not good at random access. Therefore, a DBMS has buffered and serialized logs on DRAM then output them sequentially to disk. Such a two-level architecture has improved performance.

However, using PMEM instead of disk in the two-level architecture, I got worse performance than the single-level one due to overhead. This is because PMEM is as fast as DRAM and is better at random access than disk. To clarify that, I will present differences between the designs of the two logging architectures and their performance profiling results.

I also tried to redesign some other architectures but gave them up because there seemed to be limited or little chance to improve performance by fixing them. Those histories will lead us to know what characteristics are suitable for PMEM and what components are worth being changed as PMEM-aware.


Accelerate Big Data Workloads with HDFS Persistent Memory Cache

Feilong He, Machine Learning Engineer, Intel Corporation

Jian Zhang, Software Engineer Manager, Intel Corporation

Abstract

HDFS (Hadoop Distributed File System) cache feature has a centralized cache mechanism which supports end users to just specify a path to cache corresponding HDFS data. HDFS cache could provide significant performance benefits for queries and other workloads whose high volume of data is frequently accessed. However, as DRAM is used as cache medium, HDFS cache might cause performance regression for memory intensive workloads. So its usage is limited especially in scenarios where memory capacity is insufficient.

To overcome the limitations of HDFS DRAM cache, we introduced persistent memory to serve as cache medium. Persistent memory represents a new class of memory storage technology that offers high performance, high capacity and data persistence at lower cost, which makes it suitable for big data workloads. In this session, the attendees can gain a lot of technical knowledge in HDFS cache and learn how to accelerate workloads by leveraging HDFS persistent memory cache. We will first introduce the architecture of HDFS persistent memory cache feature, then present the performance numbers of micro workloads like DFSIO and industry standard workloads like TPC-DS. We will showcase that HDFS persistent memory cache can bring 14x performance speedup compared with no HDFS cache case and 6x performance speedup compared with HDFS DRAM cache case. With data persistence characteristic, HDFS persistent memory cache can help users reduce cache warm-up time in cluster restart situations, which will also be demonstrated. Moreover, we will discuss our future work, such as potential optimizations and HDFS lazy-persistent write cache support with persistent memory.


RPMP: A Remote Persistent Memory Pool to accelerate data analytics and AI

Jian Zhang, Software Engineer Manager, Intel Corporation

Abstract

Persistent memory represents a new class of memory storage technology that offers high performance, high capacity and data persistence at lower cost to bridge the performance and cost gap between DRAM and SSDs. There are a broad usage scenario of persistent memory for data analytics and AI workloads, however, remote access to persistent memory posed lots of challenges to persistent memory applications.

RDMA is an attractive technology can be used for remote memory access. It leverages an RDMA network cards to offload the data movement from the CPU to each system’s network adapter, which improve application performance, utilization, and is capable of enable the applications to take full advantages of persistent memory devices. In this work, we proposed an innovative distributed storage system that leverages persistent memory as storage medium with a key-value storage engine, efficient RDMA powered network messenger as network layer, and used consistent hash algorithm to provided configurable data available and durability to up level applications. RPMP provides low level key-value APIs make it suitable for performance critical applications, it also implements several optimizations including a circular buffer to improve write performance, a persistent memory RDMA memory region combined technology to improve read performance.

Experiment performance numbers will be also presented, we will present the micro benchmark performance of the key-value store as well as decision support query performance of RPMP as fully disaggregated shuffle solution in Spark based data analytics.


Persistent Memory Programming Without All That Cache Flushing

Andy Rudoff, Persistent Memory SW Architect, Intel

Abstract

Persistent Memory programming has well known challenges around cache flushing and the SNIA programming model includes the possibility of platforms where the CPU caches are considered persistent and need no flushing. On these platforms, operations like compare-and-swap and some non-blocking algorithms are easier to implement, but there are still some differences from volatile memory programming that the programming must understand. In this talk, Andy will describe how such platforms are created, the programming considerations, and the performance improvements that result.


Scaling PostgreSQL with Persistent Memory.

Naresh Kumar Inna, Co-founder and CEO, Memhive

Keshav Prasad, Co-founder and CTO, Memhive

Abstract

In this talk, we describe how Persistent Memory (PMEM) can be used to speed up a relational database like PostgreSQL. While persistent memory is traditionally used to store and speed up database transaction logs, a,k.a Write Ahead Logs (WAL) in PostgreSQL, we look at other interesting possibilities.

For example, we discuss using PMEM as a large persistent cache for the database, its impact on read transaction performance, and its challenges. We look at how the relation data files of the database itself can be on PMEM, and how we can provide redundancy in this scenario. We also look at some of the challenges of adapting traditional databases like PostgreSQL to PMDK. We will discuss test results of case studies we have carried out with these options. Finally, we will explore how various operating parameters of the database like size of the database, sizes of available DRAM and PMEM all play a role in determining what is the best use of PMEM.


SplitFS: Reducing Software Overhead in File Systems for Persistent Memory

Vijay Chidambaram, Assistant Professor, University of Texas at Austin

Abstract

I will present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory-mapping the underlying file, and serving the read and overwrites using processor loads and stores. Metadata operations are handled by the kernel PM file system (ext4 DAX). SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can choose from, without interfering with each other. SplitFS reduces software overhead by up-to 4× compared to the NOVA PM file system, and 17× compared to ext4 DAX. On a number of micro-benchmarks and applications such as the LevelDB key-value store running the YCSB benchmark, SplitFS increases application performance by up to 2× compared to ext4 DAX and NOVA while providing similar consistency guarantees. This work was presented at the Symposium on Operating Systems Principles (SOSP 2019).

SMB

SMB3 over QUIC - Files Without the VPN

Sudheer Dantuluri, Software Engineer, Microsoft

Thomas Salemy, Software Engineer, Microsoft

Abstract

The SMB3 protocol is broadly deployed in enterprise networks and contains strong protection to enable its use more broadly. However, historically port 445 is blocked and management of servers on TCP have been slow to emerge. SMB3 now is able to communicate over QUIC, a new internet standard transport which is being broadly adopted for web and other application access.

In this talk, we will provided updated details on the SMB3 over QUIC protocol and explore the necessary ecosystem such as certificate provisioning, firewall and traffic management and enhancements to SMB server and client configuration.


SMB3 POSIX Extensions Phase 2 ... Now that they are in what is next?

Steven French, Principal Software Engineer, Microsoft

Abstract

With another year of work on improving the SMB3.1.1 Protocol Extensions for Linux/POSIX and its implementation in servers and the Linux client ... where do we go from here?

For optimal interoperability between Linux clients and NAS appliances, Servers and the Cloud, what features should be added?

Examining workload requirements, new Linux syscalls and Linux functional test compliance (subtests within the standard ""xfstest"" suite that fail or are skipped due to missing features) has not only shown areas where the Linux VFS could be improved (e.g. in how it temporary files are created and how swapfiles are locked) but also areas where the SMB3.1.1 protocol and its POSIX extensions can be extended. This is an exciting time, with SMB3.1.1 use in Linux increasing to access an ever wider world of storage targets securely and efficiently. This presentation will help explore how to make it even better.


The Future of Accessing Files remotely from Linux: SMB3.1.1 client status update

Steven French, Principal Software Engineer, Microsoft

Abstract

Improvements to the SMB3.1.1 client on Linux have continued at a rapid pace over the past year. These allow Linux to better access Samba server, as well as the Cloud (Azure), NAS appliances, Windows systems, Macs and an ever increasing number of embedded Linux devices including those using the new smb3 kernel server Linux (ksmbd). The SMB3.1.1 client for Linux (cifs.ko) continues to be one of the most actively developed file systems on Linux and these improvements have made it possible to run additional workloads remotely.

The exciting recent addition of the new kernel server also allows more rapid development and testing of optimizations for Linux.

Over the past year ...

  • performance has dramatically improved with features like multichannel (allowing better parallelization of i/o and also utilization of multiple network devices simultaneously), with much faster encryption and signing, with better use of compounding and improved support for RDMA
  • security has improved and alternative security models are now possible with the addition of modefromsid and idsfromsid and also better integration with Kerberos security tooling
  • new features have been added include the ability to swap over SMB3 and boot over SMB3
  • quality continues to improve with more work on 'xfstests' and test automation
  • tooling (cifs-utils) continue to be extended to make use of SMB3.1.1 mounts easier

This presentation will describe and demonstrate the progress that has been made over the past year in the Linux kernel client in accessing servers using the SMB3.1.1 family of protocols. In addition recommendations on common configuration choices, and troubleshooting techniques will be discussed.


Samba locking architecture

Volker Lendecke, Developer, SerNet GmbH

Abstract

To implement share modes, leases and oplocks in a multi-process environment like Samba, inter-process communication to coordinate locking information and state changes must be fast, reliable and understandable. From system V shared memory and UDP sockets Samba over time has adapted modern forms of shared memory and mutexes. Samba's implementation of the locking state has several layers:

  • tdb is a raw key/value store
  • dbwrap is an abstraction on tdb, allowing alternative backend k/v stores
  • dbwrap_watch enables processes to monitor changes of values
  • g_lock implements per-record locking for exclusive operations

On top of that stack Samba implements something similar to the MS-SMB2 concept of GlobalOpenTable and other global data structures required to implement SMB. The goal of this talk is to deepen understanding and to allow possible improvements and alternative implementations. For example, right now the tables are implemented only for local access and for clustered access using ctdb. It should be possible to extend this to other implementations of distributed K/V stores. This talk will provide a foundation to extend Samba beyond ctdb.


Microsoft FileServer Protocol Test Suites Overview and Updates

Helen Lu, Senior Software Engineering Manager, Microsoft

Huiren Jiang, Software Engineer, Microsoft

Abstract

In this talk, we’ll cover the latest updates of the Microsoft Protocol Test Suites for File Services. Microsoft Protocol Test Suites are a group of tools that were originally developed for in-house testing of the Microsoft Open Specifications. Microsoft Protocol Test Suites have been used extensively during Plugfests and Interoperability (IO) Labs to test against partner implementations.


Samba Multi-Channel/io_uring Status Update

Stefan Metzmacher, Developer, SerNet/Samba-Team

Abstract

Samba has experimental support for multi-channel for quite a while. SMB3 has a few concepts to replay requests safely. Some of them are still missing in Samba, which could lead to misbehaving clients.

The talk will explain how the missing features will be implemented.

With the increasing amount of network throughput, we'll reach a point where a data copies are too much for a single cpu core to handle.

This talk gives an overview about how the io_uring infrastructure of the Linux kernel could be used in order to avoid copying data, as well as spreading the load between cpu cores.


Securing SMB3 over RDMA

Wen Xin, Software Engineer, Microsoft

Abstract

SMB is premier file sharing protocols in Windows and other environments. The SMB3 protocol supports advanced encryption and integrity protection, in addition to high-performance RDMA transports. RDMA is an increasingly popular choice, but the encryption protection is not available in prior dialects of SMB over RDMA.

In this talk, we will discuss protocol changes to support full encryption and integrity over RDMA for the SMB3 protocol. We will discuss the environment which these features are required, and dive into the implementation and performance.


Please kindly go somewhere else - methods and strategies to move smb clients' connections to other cluster nodes non-disruptively

Rafal Szczesniak, Principal Software Engineer, Dell EMC

Jeremy Hitt, Principal Software Engineer, Dell EMC

Abstract

When running a scale-out cluster it is sometimes needed to run various tasks (like an upgrade) on nodes which, in effect, require rebooting them. That could potentially be disruptive for client connections and many times the answer to "when can we do that?" question is "not any time soon". Some measures do exist in the protocol stacks these days. However, when it comes to the older stacks they do not, simply because when they had been developed, the clustered storage wasn't so common. Therefore, we need to resort to all sorts of tricks exploiting the clients' behaviour in order to get as close to the "seemless" experience as we possibly can. The talk describes both the preferred methods and the necessary tricks played to help avoiding the problems. Both may be helpful for smb client implementers to enhance their clients' flexibility.

Solid State Storage Solutions

Enabling Ethernet Drives

Mark Carlson, Principle Architect, Kioxia

Abstract

The SNIA has a new standard that enables SSDs to have an Ethernet interface, The Native NVMe-oF Drive specification defines pin outs for common SSD connectors. This enables these drives to plug into common platforms such as eBOFs (Ethernet Bunches of Flash).

This talk will also discuss the latest Management standards for NVMe-oF drives. Developers will learn about how to program and use these new types of drives.


An SSD for Automotive Applications

Bill Gervasi, Principal Systems Architect, Nantero

Abstract

The next generation of cars will require significant improvements in data management. Information for drive management, sensors, network connectivity, high resolution displays, security, in-car entertainment, and other sources will require a rethinking of storage devices. This presentation details efforts under way in JEDEC to define a new Automotive SSD standard to address these new requirements.

Storage Architecture

Quantum computing

Brian Eccles, Principal Analyst, IBM

Mark Lantz, Principal Research Staff Member, IBM Research

Abstract

Quantum computing is arriving. With quantum computers we can tackle problems in entirely new ways. AI/ML, simulations, optimization and modelling physical processes are areas where quantum computing may show earliest impact. In this session we'll cover what is quantum computing, why it is so different from classical computing, why it is important, applications, early usage, resources available to developers, and how developers can get started with access to real quantum systems, for free, by the end of the day.

We'll also look at the potential impact of quantum computing on today's encryption approaches, implications for long-term storage, and work already underway to safeguard long term data retention and future data storage in a post quantum world.


Real World Experiences With High Availability Storage Systems

Jody Glider, Principal Storage Architect, Cloud Architecture and Engineering, SAP

Abstract

A dominant architecture for storage arrays has been some variant of an HA-pair where a set of drives is connected to two separate data paths (aka controllers). Generally these systems have been designed to continue to provide service after any single hardware failure, yet experience shows that single faults have caused disruption in data service...and not for the reasons you might think! This talk describes analysis performed on a set of storage service disruptions over a period of two years, points out some common patterns, shares some thoughts about possible improvements, and most of all asks for help in contemplating what improvements will lead to even better reliability in storage service.


Why do customers need media path redundancy in a storage array with simple low-latency hardware paths?

Mahmoud Jibbe, Technical Director, NetApp

Joey Parnell, Sr Architect, NetApp

Abstract

Many emerging white box storage systems replace mature, robust drive networks with simple point-to-point drive port topologies. The presentation provides fault tolerance where each storage controller has a dedicated port to a drive, but no hardware path to other port(s) to that drive. The redundant access to media is provided via Non-Transparent Bridging (NTB), Ethernet, or other networks to overcome lack of redundancy in the topology

Accelerate artificial intelligence IoT use cases with storage tiering and shared storage at the edge

Joey Parnell, Sr Architect, NetApp

Mahmoud Jibbe, Technical Director, NetApp

Abstract

Transmitting data from Internet-of-Things (IoT) edge devices to core data centers to perform resource intensive artificial intelligence (AI) use cases is costly in terms of network bandwidth and latency. Alternatively, placing hardware resources such as GPUs at the edge to perform those operations locally and reduce network congestion and latency can be prohibited by cost and power requirements.

Through use of shared storage presented by the IoT device, more compute-intensive iterative refinement training can be performed either in the core data center or in cloud analytics platforms, and updated AI inference data transmitted back to IoT devices to provide customized training to improve reliability and reduce false positive rates.

By adding flash storage and extending the shared storage between IoT devices, distributed applications can divvy up work to idle or under-utilized devices, store the results locally, and send metadata to the originating device about where to read the results. This allows an IoT device to coordinate and complete AI tasks that may exceed the computing capabilities or the latency requirements of the singular device and minimizes the data traveling to and from data centers and clouds.

Finally, critical IoT data that is transmitted to core data centers can be either regularly archived to cloud or protected by disaster recovery solutions in the cloud. Minimize risk without significantly increasing cost by selectively using cloud resources. Store data on premises and mount that data as a target from the cloud via a gateway to perform analytics and transmit back only the results to the data center, or archive data for long term retention.

  1. Satisfy real-time IoT AI use cases with higher fidelity and without significantly increasing cost by providing shared storage in the IoT device to allow other devices to perform work and remotely update the local inference models.
  2. Add a layer of flash to IoT devices for data tiering provides the capability to perform AI tasks in a distributed fashion to utilize idle resources.
  3. Protect valuable IoT data at lower cost by selectively using cloud resources for archive and disaster recovery solutions.

Cloud Data Center Designs Will Become HyperDisaggregated

Dr. Jai Menon, Chief Scientist, Fungible

Abstract

The growing demand for real-time data insights enabled by machine learning and artificial intelligence are causing data centers to process more data than ever before. Current compute-centric data center designs are inadequate at meeting the demands of this data-centric world. Data Processing Unites (DPUs) were designed to overcome the root causes of these inadequacies. Future data centers, enabled by the DPU, will allow all resources in a data center - storage, GPUs and, ultimately, memory to be disaggregated from compute over a highly efficient, low-latency, scalable network fabric. This hyperdisaggregated data center will eliminate the need for resource overprovisoning and ensure that remote resources perform as well as local resources. The resultant data center will have >2X better TCO, footprint and power than today’s compute-centric data centers.


Smart Storage Adapter for Composable Architectures

Remy Gauguey, Sr Software Architect - Data Center Business Unit, Kalray

Abstract

The variety of architectures, use-cases and workloads to be managed by Data Center appliances is increasing. It is driving a need for storage and compute disaggregation, while at the same time forcing IT pros to simplify Data Center management and move to hyperconverged infrastructure. However the HCI approach results in siloing of storage that leads to capacity waste and scalability issue.

This paper describes how Kalray’s fully programmable Smart Storage adapter leverages NVMe-oF technology to offload servers from heavy storage disaggregation task, and pave the way toward a fully Composable Infrastructure.


Using Block Translation and Atomic Updates to Optimize All Flash Array Storage

Douglas Dumitru, CTO, EasyCo LLC

Abstract

Block translation is often used to linearize writes. This is especially useful for Flash SSDs as well as SMR hard disks. It likewise improves write performance in parity-based arrays.

Block translation has an additional capability that can be exploited to create whole new storage paradigms. Block translation lets you construct efficient atomic updates that let you use nearly 100% of your arrays’ write bandwidth to store actual, useful, data. Atomic linearization opens an array of possibilities. Here are some examples that have actually been implemented:

  • • Write a file system that lets you create hundreds of thousands of small files in less than a second, from a single thread, with greater than 90% space efficiency.
  • Export blocks to other file systems or a SAN with near array speed writes regardless of workload block size or pattern.
  • Reduce data with in-line compression and de-duplication without suffering a massive memory footprint or glacial performance.

The magic which makes such things possible is efficient, linear, generation based, atomic writes. Instead of scattering data, metadata, allocation bitmaps, and journal elements across the array, this solution writes elements together as part of a linear, atomic, string.

  • Linear IO: Each atomic write maintains absolute linearity and optimal alignment at the SSD and the array level.
  • Inline: Each atomic write contains the actual, live, data. There are no additional copies or journals.
  • Merge-able: Each atomic write can be appended with new data. Thousands of transactions can be combined eliminating intermediate updates for maximum bandwidth and space efficiency.
  • Flexible: Atomic updates can include any collection of data blocks or even the absence of blocks. You are not limited in what you can put into a transaction.
  • Variable Block Size: The blocks that you can store can contain from 16 bytes to 1 megabyte or more of payload for a single LBA. You can map your storage structure directly to blocks, even if the structure has variable sized elements.
  • Safe: All IO is validated end-to-end with CPU-assisted checksums.
  • Optimized for Bandwidth: The update structure allows for hundreds of megabytes in a single IO “write segment”. Big data can finally move at device speed.

This structure not only optimizes performance, but cost. SSDs are used with ideal write workloads, lowering both wear and cost. Linear writes and alignment mean that erasure codes out-perform mirroring and reach theoretical device speed. One user summed it up well. “We write faster than we read”.


SmartNIC composable framework for flexible system partition

Remy Gauguey, Sr Software Architect - Data Center Business Unit, Kalray

Abstract

The variety of architectures, use-cases and workloads to be managed by Data Center appliances is increasing. It is driving a need for more and more flexibility in the system partition. This paper describes the architecture of a modular framework relying on standard modules and APIs such as Virtio or SPDK DBEV, and leveraging the parallelism of manycore processors. Mixing networking, storage or RDMA services, and taking advantage of hardware features such as SR-IOV, it allows for building efficient and compact SmartNICs combining a 200GE, PCIe Gen4 fast path with many offloading and value-added services. This SmartNIC architecture is a key enabler for many applications including Bare Metal Cloud, Software-Defined Network using OpenVSwitch, or advanced storage I/O servers.


HDD Multi-channel Deployment Scenarios in Linux Environments

Arie van der Hoeven, Principal Product Manager, Seagate Technology

Curtis Stevens, Technologist, Seagate Technology

Abstract

Multi-channel HDDs (aka Dual Actuator) provide opportunities to increase throughput and performance in HDD deployments. In this presentation, the authors will review current Dual-actuator architectures and implementation scenarios in Linux environments, including performance impacts, deployment considerations.


Unlocking the New Performance and QoS Capabilities of the Software Enabled Flash™ API

Rory Bolt, Principal Architect, Senior Fellow, KIOXIA America

Abstract

The Software-Enabled Flash API gives unprecedented control to application architects and developers to redefine the way they use flash for their hyperscale applications, by fundamentally redefining the relationship between the host and solid-state storage. Dive deep into new Software-Enabled Flash concepts such as virtual devices, Quality of Service (QoS) domains, Weighted Fair Queueing (WFQ), Nameless Writes and Copies, and controller offload mechanisms. This talk by KIOXIA (formerly Toshiba Memory) will include real-world examples using the new API to define QoS and latency guarantees, workload isolation, minimize write amplification by application-driven data placement, and achieve higher performance with customized flash translation layers (FTL).


Breaking the Ten Millions IOps barrier for RAID.

Sergei Platonov, VP of Strategy, RAIDIX

Abstract

Modern NVMe drives allow us to get dozens of millions of IOps in a single system. However, the current RAIDs and volume management software based on parities (aka RAID5 and beyond) limit total performance to two million IOps for reading and 0.2 million for writing.

We redevelop the RAID engine for Linux using extreme parallelization of IO handling on the basis of a lock-less approach and met 10 million barrier for read and write operations.

To break the barrier, we moved to the new Linux kernel that supports polling modes for NVMe drives and io_uring interface.


Tiered Storage Deployments with 24G SAS

Jeremiah Tussey, Board of Directors, STA; Alliances Manager, Microchip

Abstract

Serial Attached SCSI (SAS) is the only storage interface that embraces both high performance and high reliability, as well as providing native compatibility with low-cost per gigabyte SATA drives. This capability allows SAS to span a variety of storage environments, including tiered storage solutions.

Large-scale data infrastructures utilize tens of thousands of HDDs and SSDs. Hyperscale companies need to be able to carefully manage them from a global perspective in a cost-effective way.

During this presentation, the speaker will review the benefits of tiered storage and how the latest features standardized in 24G SAS storage interface technology is helping enterprises store and move data across a range of storage media with different characteristics, such as performance, cost and capacity.


Is Gaming changing the Storage Architecture Landscape?

Leah Schoeb , Sr Developer Relations Manager, AMD

Abstract

The gaming industry is rapidly growing and so is the need for extreme high performance high capacity storage. Gaming storage requirements are driving new storage architectures and technologies to their limits from loading to playing to archiving. Recent advances in cloud technology have turned the idea of cloud gaming into a reality. Cloud gaming, in its simplest form, renders an interactive gaming application remotely in the cloud and streams the scenes as a video sequence back to the player over the Internet. This is an advantage for less powerful computational devices that are otherwise incapable of running high-quality games. This session will discuss real world performance with different types of games, for performance, interaction latency, and streaming quality. Revealing critical challenges toward the widespread deployment of cloud gaming and their storage requirements for best user experience.


Novel Technique of High-Speed Magnetic Recording Based on Manipulating Pinning Layer in Magnetic Tunnel Junction-Based Memory by Using Terahertz Magnon Laser

Boris Tankhilevich, CEO, Magtera, Inc.

Abstract

An apparatus for novel technique of high-speed magnetic recording based on manipulating pinning layer in magnetic tunnel junction-based memory by using terahertz magnon laser is provided. The apparatus comprises a terahertz writing head configured to generate a tunable terahertz writing signal and a memory cell including a spacer that comprises a thickness configured based on Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction. The memory cell comprises two separate memory states: a first binary state and a second binary state; wherein the first binary memory state corresponds to a ferromagnetic sign of the Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction corresponding to a first thickness value of the spacer; and wherein the second binary memory state corresponds to an antiferromagnetic sign of the Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction corresponding to a second thickness value of the spacer. The thickness of the spacer is manipulated by the tunable terahertz writing signal.


A NVMe-oF Storage Diode for classified data storage

Jean-Baptiste Riaux, Senior Field Application Engineer, Kalray

Abstract

Developing a “Storage Diode” by combining specific pieces of storage technologies such as HDF5, multipathing, ACL, user authentication (Kerberos, LDAP...) while leveraging NVMe-oF, is very useful for classified sites requiring remote and secure replication on NVMe SSDs.

The storage diode is a dedicated storage system with two isolated Read and Write path, with guaranty of the data integrity. Leveraging dual port NVMe drives and the parallelism of advanced processors, this paper reviews how to fully isolate channels at both logical and physical levels, and dedicate write-only and read-only path to storage devices over a NVMe-oF fabric.

This technique allows restricted/classified computing center to push (write) data to the storage diode, assuring the path to the outside world can be only be accessed in Read-Only.


Rethinking Distributed Storage System Architecture for Fast Storage Devices

Myoungwon Oh, Rethinking Distributed Storage System Architecture for Fast Storage Devices, Samsung Electronics

Abstract

Storage devices have been drastically evolved for the last decade. However, the advent of fast storage technology poses unprecedented challenges in the software stack; the performance bottleneck is shifted from storage devices to software. For instance, a modern large-scale storage system usually consists of a number of storage nodes connected via network and they communicate with each other all the time for cluster-level consistency and availability. Furthermore, it is common for a storage server to have multiple NVMe SSDs, which consequently increases the amount of work for I/O processing by multiple times. Because of all these reasons, storage nodes tend to lack CPU resources especially when handling small random I/Os. In this talk, we propose a new design of distributed storage system for fast storage devices that focuses on minimizing CPU consumption while achieving both higher IOPS and lower latency. Our design is based on the following three ideas:

1. Lightweight data store: Backend data store should be as lightweight as possible. We should rethink the trend of accelerating I/O at the cost of burdening the host’s CPU. For example, LSM-tree-based key-value stores sequentialize I/Os for better random write performance and for efficient device-level GC. However, this requires costly compaction process, which is known to consume non-negligible CPU power. To alleviate the burden on the host side, we have prototyped in-place update based data store. It is also partitioned so that they can be accessed in parallel without synchronization.

2. Thread control: RTC model is a well-known technique to lower I/O latency by mitigating context switching overhead and inefficient cache operation. However, without an efficient thread control and careful partitioning of the lock space, a latency critical task would be blocked by a slow non-critical task. To avoid this problem, we propose a priority-based run-to-completion model. It runs latency-critical tasks on dedicated CPU cores, while others on remaining shared cores.

3. Mitigating replication overhead: We propose a replication method which relies on our NVMeoF-based storage solution. Our storage solution has enough computation power to process more works than the conventional storage while providing more reliability by internal redundancy mechanism. With our storage solution, we present a way that offloads replication works to the NVMeoF-based storage solution without losing fault tolerance while reducing CPU consumption: 1) decoupling fault domain between compute and storage node 2) adding new mapping to existing storage system.

For performance evaluation, we have implemented our design based on Ceph. Compared to the existing approach, our prototype system delivers significant performance improvement for small random write I/Os.

Storage Networking

A QUIC Introduction

Lars, Eggert, Technical Director, NetApp

Abstract

QUIC is a new UDP-based transport protocol for the Internet, and specifically, the web. Originally designed and deployed by Google, it already makes up 35% of Google's egress traffic, which corresponds to about 7% of all Internet traffic. The strong interest by many other large Internet players in the ongoing IETF standardization of QUIC is likely to lead to an even greater deployment in the near future. This talk will highlight:

  • Unique design aspects of QUIC
  • Differences to the conventional HTTP/TLS/TCP web stack
  • Early performance numbers
  • Potential side effects of a broader deployment of QUIC

Smart Fabrics: Building Self-Healing Fibre Channel Networks

Brandon Hoff, Director, Product Management, Fibre Channel Industry Association, Broadcom Inc.

Rupin Mohan, Director R&D, CTO SAN, HPE

Abstract

IT administrators are faced with a surge in digital demands while at the same time being overloaded with issue isolation and troubleshooting performance problems. Given their demanding workload, wasted time becomes a stumbling block for the digital businesses they support. These administrators are being judged by a new set of rules: accelerate IT delivery and increase focus on digital transformation. Fabric Notifications, a new solution from the INCITS T11 Committee, enables hosts and Fibre Channel Fabrics to collaborate and identify and remediate events that cause performance problems on storage area networks. Today, the lossless, low-latency, high-performance storage connectivity that Fibre Channel delivers makes it the trusted technology for enterprise customers and a majority of networked block storage. Fabric Notifications builds on the benefits of Fibre Channel by sharing information between the Fabric and Hosts, enabling the Fabric and Hosts to collaborate and remediate performance problems. This session will discuss what Fabric Notifications are, why they are important, the benefits of freeing up an IT administrator’s time, and how developers can take advantage of Fabric Notifications in their products.


Understanding Compute Express Link: A Cache-coherent Interconnect

Debendra Das Sharma, Intel Fellow; Director of I/O Technology and Standards Group, Intel

Abstract

Compute Express Link™ (CXL™) is an industry-supported cache-coherent interconnect for processors, memory expansion, and accelerators.

Datacenter architectures are evolving to support the workloads of emerging applications in Artificial Intelligence and Machine Learning that require a high-speed, low latency, cache-coherent interconnect. The CXL specification delivers breakthrough performance, while leveraging PCI Express® technology to support rapid adoption. It addresses resource sharing and cache coherency to improve performance, reduce software stack complexity, and lower overall systems costs, allowing users to focus on target workloads.

Attendees will learn how CXL technology maintains a unified, coherent memory space between the CPU (host processor) and CXL devices allowing the device to expose its memory as coherent in the platform and allowing the device to directly cache coherent memory. This allows both the CPU and device to share resources for higher performance and reduced software stack complexity. In CXL, the CPU host is primarily responsible for coherency management abstracting peer device caches and CPU caches. The resulting simplified coherence model reduces the device cost, complexity and overhead traditionally associated with coherency across an I/O link.


Use cases for NVMe-oF for Deep Learning Workloads and HCI Pooling

Nishant Lodha, Director of Technologies, Marvell

Abstract

The efficiency, performance and choice in NVMe-oF is enabling some very unique and interesting use cases – from AI/ML to Hyperconverged Infrastructures. Artificial Intelligence workloads process massive amounts of data from structured and from unstructured sources. Today most deep learning architectures rely on local NVMe to serve up tagged and untagged datasets into map-reduce systems and neural networks for correlation. NVMe-oF for Deep Learning infrastructures enables a shared data model to ML/DL pipelines without sacrificing overall performance and training times. NVMe-oF is also enabling HCI deployment to scale without adding more compute, enabling end customers to reduce dark flash and reduce cost. The talk explores these and several innovative technologies driving the next storage connectivity revolution.


CXL 1.1 Protocol Extensions: Review of the cache and memory protocols in CXL.

Robert Blankenship, Principal Engineer, Intel Corporation

Abstract

The CXL interface adds both a memory and a caching protocol between a host CPU and a device. The Memory Protocol enables a device to expose memory region to the host to be used as system memory. The Caching Protocol can be used to directly cache host memory allowing devices to implement advanced flows like data prefetching and hardware atomics within a cache. Devices supporting both protocols can directly access the memory exposed to the host enabling high performance accelerator and computation storage use cases that are tightly coupled with the host CPU.

This presentation will review the cache hierarchy in a modern server CPU and review the memory and cache protocol flows used in CXL allowing a memory device and/or accelerator to directly participate in the cache hierarchy of the CPU.


Storage Performance / Workloads

Realistic Synthetic Data at scale: Influenced by, but not production data

Mehul Sheth, Principal Performance Engineer, Veritas Technologies LL

Abstract

To have a high confidence in a product, testing it against a data set which resembles production data is must. The challenge is in generating data for testing that represents production. The data in production is not predictable, it doesn’t follow simple formula, there are many variables that characterize it. Broadly, test data can be divided into two categories: Arbitrary, which is random and unstructured and Realistic, which follows patterns, is predictable and controlled. To generate a Realistic test data, right patterns needs to be captured by analyzing the existing production data. Access to production data can be regulated and not easy to obtain. However, implementing code to read relevant data from production, without exposing the actual data, but updating models which are used to generate test data, when required such that the generated test data represents production data in selected dimensions, as directed by the business of the product under test.

In this session Mehul Sheth will talk about Druva's journey in generating test data at scale, which is highly influenced by production data, has ""genes"" of production data but not a single byte is taken ""as-is"" from production. Although Druva's journey and decisions taken may be unique and not directly applicable in all scenarios, session will highlight the thought process, algorithms and decisions in a generic fashion. How to focus on the ability to assess the model and tweak it to include edge conditions, remain realistic, applicable at all time, versatile, repeatable and easily controllable.

Specifically, the session describes a process for modeling a directory tree with files and folders with various variables (like size of file, number of files and folders in each folder at each depth, patterns in names of files and folders, ratio of different file types and other variables) which may be important for the application under test. And then how to apply this model to generate file-sets of different sizes but completely random data, maintaining the relations between modeled variables. Datasets thus generated are random in raw format, however, maintain the characteristics of the model and can be used for performance / stress testing anti-virus software, legal discovery software or backup software. Extending the concept further, it can be used to model any data and meta-data like mailboxes or transnational databases.


Capture, Monitor and Analysis of Real World Edge workloads for server and application optimization

Eden Kim, CEO, Calypso Systems, Inc.

Abstract

Real World Edge workloads, from IoT to servers, edge servers, nodes and datacenter storage servers, are highly effective for optimizing server storage and applications. See how edge workloads are monitored and captured for workload balancing, curation and test script creation. Edge workloads are also monitored in real time to balance server loads and provide real time alerts for Key Performance Indicators. Curated Edge workloads are also used to optimize storage and applications and to provide Training for AI Machine Learning Long Short Term Memory Recurrent Neural Networks (AI ML LSTM RNN).


Platform Performance Analysis for I/O-intensive Applications

Ilia Kurakin, Senior Software Engineer, Intel Corporation

Perry Taylor, Senior Performance Monitoring Engineer, Intel Corporation

Alexander Antonov, Senior Software Engineer, Intel Corporation

Denis Pravdin, Senior Software Engineer, Intel Corporation

Abstract

High performance storage applications running on Intel® Xeon® processors actively utilize I/O capabilities and I/O accelerating features of platform by interfacing with NVMe devices. Such I/O-intensive applications may suffer from performance issues, which in a big picture can be categorized into three domains: (1) I/O device bound – performance is limited by device capabilities (2) core bound – performance is limited by algorithmic or microarchitectural code issues (3) uncore bound – performance is limited by non-optimal interactions between devices and CPU. This talk focuses on the latter case.

In Intel architectures the term “core” covers execution units and private caches, and all the rest of the processor is referred as “uncore”, which includes on-die interconnect, shared cache, cross-socket links, integrated memory and I/O controllers, etc. Activities happening on IO path in uncore cannot be monitored with traditional core-centric analyses, but there are pitfalls that require uncore-centric view. Intel servers provide such view by incorporating thousands of uncore performance monitoring events that can be collected in performance monitoring units (PMUs) associated with uncore IP blocks. However, using raw counters for performance analysis requires deep knowledge of hardware and appears incredibly challenging.

In this talk we will discuss platform-level activities induced by I/O traffic on Intel® Xeon® Scalable processors and summarize practices for best performance of storage applications. We will overview telemetry points staying on the IO traffic path and eventually present developing uncore-specific performance analysis methodology, that reveals platform-level inefficiencies, including poor utilization of Intel® Data Direct I/O Technology (Intel® DDIO).


Array Level Steady State Detection for ZFS Storage Servers

Ryan McKenzie, Senior Platform and Performance Engineer, iXsystems

Abstract

In this talk we will present a case study on detecting storage array steady state using a ZFS storage server. Determining array level steady state is valuable because it prevents variability in results caused by measuring performance during non-steady time periods, it saves personnel and equipment test time by avoiding retesting, and it can give insight on how various load conditions impact the performance of your storage array. We will present a set of metrics to track on the ZFS storage server and apply the SNIA Emerald steady state calculations for SSDs to these metrics.

Storage Resource Management

SNIA Swordfish™ Overview and Deep Dive

Richelle Ahlvers , Board of Directors, SNIA

Abstract

If you’ve heard about the SNIA Swordfish open industry storage management standard specification but are looking for a deeper understanding of its value and functionality, this presentation is for you. The speaker will provide a broad look at Swordfish and describe the RESTful methods and JSON schema variants developed by SNIA’s Scalable Storage Management Technical Work Group (SSM TWG) and the Redfish Forum.


What’s New in SNIA Swordfish™

Richelle Ahlvers , Board of Directors, SNIA

Abstract

If you haven’t caught the new wave in storage management, it’s time to dive in and catch up on the latest developments of the SNIA Swordfish™ specification. These include:

  • Adding support to map NVMe and NVMe-oF to Redfish and Swordfish
  • A new document with implementers with guidance for error reporting and status code usage
  • New mockups on swordfishmockups.com showing more possible deployment permutations
  • Development of Swordfish CTP
  • ISO Standardization
  • Schema enhancements and simplifications: Moving /Storage to the Service Root
  • Tools ecosystem Enhancements: Learn about all the new tools to help with everything from mockup self-validation to protocol checking.

How to Increase Demand for Your Products with the Swordfish Conformance Test Program

Richelle Ahlvers , Board of Directors, SNIA

Abstract

New this year, the SNIA Swordfish Conformance Test Program allows manufacturers the ability to test their products with a vendor-neutral test suite to validate conformance to the SNIA Swordfish specification.

Swordfish implementations that have passed CTP are posted on the SNIA website; this information is available to help ease integration concerns of storage developers and increase demand for available Swordfish products.

This session will provide an overview an overview of the program, what functionality implementations and base requirements are needed for implementations to pass the initial version of Swordfish CTP. It will also cover the program features, additional benefits and how to participate.


NVMe and NVMe-oF Configuration and Manageability with Swordfish and Redfish

Rajalaxmi Angadi, Senior Software Developer, Intel Corporation

Krishnakumar Gowravaram, Senior Technical Leader and Architect, Cisco

Abstract

The SNIA Swordfish specification is currently growing to include full NVMe and NVMe-oF enablement and alignment across DMTF, NVMe, and SNIA for NVMe and NVMe-oF use cases. This presentation will provide an overview of the work in progress to map these standards together to ensure NVMe and NVMe-oF environments can be represented entirely in Swordfish and Redfish environments.


Zero to Swordfish Implementation Using Open Source Tools

Don Deel, SMI GB Chair, SNIA; Senior Standards Technologist, NetApp, SNIA; NetApp

Chris Lionetti, Board of Directors, SNIA; Senior Technical Marketing Engineer, HPE

Abstract

SNIA’s Storage Management Initiative sponsored the initial development of open source software tools that can help developers start working with Swordfish. These tools are available in open repositories that are managed by the SNIA Scalable Storage Management Technical Working Group on GitHub.

This session will walk through the tools you can use to go from zero to working SNIA Swordfish implementations. Starting from generating, validating and using static mockups, using the emulator to make your mockups “come alive,” and then verifying your Swordfish service outputs match your expectations using open source validation tools; the same tools that feed into the Swordfish Conformance Test Program.


Migrating OEM Extensions to Swordfish for Scalable Storage Management

Krishnakumar Gowravaram , Senior Technical Leader and Architect, Cisco

Abstract

Before the release of the SNIA Swordfish™ v1.1.0 specification, direct attach server vendors trying to accomplish scalable or complex storage management in the DTMF Redfish® standard, had to use OEM extensions to extend the limited storage management functionality that Redfish provides. Redfish is designed to manage converged, hybrid IT and the software defined data center.

During this presentation, the speaker from Cisco will provide an overview of the company’s existing storage management solution using Redfish storage and OEM extensions. The speaker will also discuss Cisco’s implementation experience to-date that consists of planning the migration of its OEM storage management Redfish extensions to the standards-based schema in the v1.1.0 SNIA Swordfish specification.


Redfish Ecosystem for Storage

Jeff Hilland , President, DMTF; Distinguished Technologist at HPE, DMTF; HPE

Scott Bunker, Server Storage Technologist, HPE

Abstract

DMTF’s Redfish® is a standard designed to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center (SDDC).

This presentation will provide an overview of DMTF’s Redfish standard. It will also provide an overview HPE’s implementation of Redfish, focusing on their storage implementation and needs.

HPE will provide insights into the benefits and challenges of the Redfish Storage model, including areas where functionality added to SNIA™ Swordfish is of interest for future releases.

Zoned Storage

Zoned Namespaces (ZNS) SSDs: Disrupting the Storage Industry

Matias Bjørling, Director, Emerging System Architectures, Western Digital Corporation

Abstract

The Zoned Namespaces (ZNS) SSDs is a new Command Set in NVMe™, it exposes a zoned block storage interface between the host and the SSD, that allows the SSD to align the data to its media perfectly. As a result, an SSD can now expose more storage capacity (+20%), reduce SSD write amplification (4-5x), and improve I/O access latencies.

This talk introduces the Zoned Namespaces Command Set, which defines a new type of namespace (Zoned Namespaces) and the associated Zone Storage Model, optimized for SSDs. We show specific use cases where ZNS applies, and how to take advantage of it and use it in your applications.


Reviving The QEMU NVMe Device (from Zero to ZNS)

Klaus Jensen, Staff Software Engineer, Samsung Electronics

Abstract

The QEMU NVMe device allows developers to test host software against an emulated and easily inspectable PCIe device implementing NVMe. Unfortunately development and addition of new features has mostly stagnated since its original inclusion in the QEMU project.

This talk will explore how development of the device is being revivied by the addition of NVMe v1.3 and v1.4 mandatory support, as well as various optional features such as multiple namespaces, DULBE, end-to-end data protection and upcoming NVMe technical proposals.

We will discuss how the tracing and debugging features of the device can be used to validate host software and testing frameworks and how the extensibility of the device allows rapid prototyping of new NVMe features. Specifically we will explore a full implementation of Zoned Namespaces and how this support is used to develop and verify host software.


Zoned Block Device Support in Hadoop HDFS

Shin'ichiro Kawasaki, Principal Engineer, Western Digital Corporation

Abstract

Zoned storage devices are a class of block devices with an address space that is divided into zones which, unlike regular storage devices, can only be written sequentially. The most common form of zoned storage today are Shingled Magnetic Recording (SMR) HDDs. This type of disk allows higher capacities without a significant device manufacturing cost increase, thereby resulting in overall storage cost reductions.

Support for zoned block devices (ZBD) was introduced in Linux with kernel version 4.10. This support provides an interface for user applications to manipulate zones of a zoned device and also guarantee that writes issued sequentially will be delivered in the same order to the disk, thereby meeting the device sequential write constraint.

Hadoop HDFS is a well known distributed file system with high scalability properties, making it an ideal choice for big data computing applications. HDFS is designed for large data sets written mostly sequentially with a streaming like access pattern. This characteristic is ideal for zoned device support, facilitating direct access to the device from HDFS rather than relying on an underlying local file system with ZBD support, an approach that potentially has higher overhead due to the file system garbage collection activity.

This talk introduces a candidate implementation of ZBD support in Hadoop HDFS based on the simple Linux zonefs file system. This file system exposes the zones of a zoned device as files. HDFS data blocks are themselves stored in zonefs files. Symbolic links reference the zonefs files in HDFS block file directory structure. File I/Os unique to zonefs files are encapsulated with a new I/O provider. The presentation will give an overview of this implementation and discuss performance results, comparing the performance of HDFS without any modification using a ZBD compliant local file systems (btrfs) with the performance obtained with the direct access zonefs approach. The benefits in terms of lower software complexity of this latter approach will also be addressed.


zonefs: Mapping POSIX File System Interface to Raw Zoned Block Device Accesses

Damien Le Moal, Director, Western Digital Corporation

Abstract

The zonefs file system is a simple file system that exposes zones of a zoned block device (host-managed or host-aware SMR hard-disks and NVMe ZonedNamspace SSDs) as files, hiding from the application most zoned block device zone management and access constraints. The zonefs file system is intended as a simple solution for use cases where raw block device accesses from the application have been considered a better solution.

This talk will present zonefs features, with a focus on how the rich POSIX file system-call interface is used to seamlessly implement the execution of zoned block device specific operations. In particular, the talk will cover zonefs changes to seamlessly accommodate the new NVMe Zoned Namespace (ZNS), such as number of active zones, the time zones can remain in the active state, and the new Zone Append command. The talk will conclude with an example use of zonefs with the key-value store application LevelDB, showing the advantages in term of code simplicity over raw block device file accesses. Performance results with LevelDB as well as with synthetic benchmarks are also shown.


High-perforance SMR drives with dm-zoned caching

Hannes Reinecke, Kernel Storage Architect, SUSE Software Solutions

Abstract

SMR drives have a very demanding programming model, requiring the host software to format write requests withing very strict limits. This typically imposes a performance penalty when writing to SMR drives such that the nominal performance is hard to achieve.

The existing dm-zoned device-mapper target implements an internal caching using random zones; while this allows for unmodified host software to run on SMR drives the performance impact is even more severe.

In this talk I will present an update to dm-zoned, which extends the current implementation to use additional drives, either as a cache device or as additional zoned devices. This allows to saturate the SMR drives without having to modify the host software.

By using a fast cache device like NV-DIMM one can easily scale the dm-zoned device across several SMR drives, presenting tens of terabytes to the application with near-native NV-DIMM speeds.

I will present the design principles of this extension, and providing a short demo for showing the improvements.


Improve Distributed Storage System TCO with Host-Managed SMR HDDs

Albert Chen, Founder, KALISTA IO

Abstract

Host-managed shingled magnetic recording (HM-SMR) devices write data sequentially to the platter and overlap new tracks on parts of previously written tracks. This results in higher track density to enable higher capacity points. The increase in drive capacity leads to lower total cost of ownership through fewer devices and servers as well as reduced maintenance, power and cooling costs. HM-SMR devices also have performance advantages. Since the host takes responsibility and control of device state and data placement, it can optimize to reduce tail latency, increase throughput, and manage performance at scale.

However, these advantages come at a cost as HM-SMR devices have a more complex and restrictive usage model compared to conventional HDDs. They require hosts to write sequentially, align IOs to device zone boundaries, and actively monitor and set zone states. In addition, host system software and hardware must be able to recognize and support the newly defined HM-SMR device type and zone management commands.

Currently, there are a spectrum of tools available to enable HM-SMR in the storage stack such as SG_IO, libzbc, f2fs and dm-zoned. However, they require users to modify their applications or to be on a specific kernel version with additional modules. In addition, they cannot be easily containerized/virtualized to fit into today’s software-defined environments. These dependencies and restrictions result in confusion, friction and disruption to the user experience that frustrates both storage vendors and users.

In this talk, we will share our experience in creating a device friendly storage system to enable applications to use HM-SMR devices without modification nor worrying about kernel dependencies. This independence allows for easy containerization to fit seamlessly into existing workflows and orchestration frameworks. We will demonstrate how a HM-SMR solution with Ceph, Hadoop and Minio can be enabled with just 2 commands in the command line interface (CLI).

This presentation will introduce a novel row/column architecture and log structured data layout that minimize IO contention and latency while preventing hot write areas. We will see real life examples on how host software with HM-SMR can reduce long tail latency and increase performance consistency by eliminating device background work and expensive unnecessary seeks to enable devices to perform at their best. Finally, we will discuss benchmark performance results comparing our solution with HM-SMR versus legacy filesystems (e.g. xfs and ext4) with CMR drive.


ZNS: Enabling in-place updates and transparent high queue-depths

Javier Gonzalez, Principal Software Engineer, Samsung Electronics

Kanchan Joshi, Staff Engineer, Samsung Electronics

Abstract

Zoned Namespaces represent the first step towards the standardization of Open-Channel SSD concepts in NVMe. Specifically, ZNS brings the ability to implement data placement policies in the host, thus providing a mechanism to lower the write-amplification factor (WAF), (ii) lower NAND over-provisioning, and (iii) tighten tail latencies. Initial ZNS architectures envisioned large zones targeting archival use cases. This motivated the creation of the ""Append Command” - a specialization of nameless writes that allows to increase the device I/O queue depth over the initial limitation imposed by the zone write pointer. While this is an elegant solution, backed by academic research, the changes required on file systems and applications is making adoption more difficult.

As an alternative, we have proposed exposing a per-zone random write window that allows out-of-order writes around the existing write pointer. This solution brings two benefits over the “Append Command”: First, it allows I/Os to arrive out-of-order without any host software changes. Second, it allows in-place updates within the window, which enables existing log-structured file systems and applications to retain their metadata model without incurring a WAF penalty.

In this talk, we will cover in detail the concept of the random write window, the use cases it addresses, and the changes we have done in the Linux stack to support it.


xNVMe: Programming Emerging Storage Interfaces for Productivity and Performance

Simon Lund, Staff Engineer, Samsung

Abstract

The popularity of NVMe has gone beyond the limits of the block device. Currently, NVMe is standardizing Key-Value (KV) and Zoned (ZNS) namespaces, and discussions on the standardization of computational storage namespaces have already started.

While modern I/O submission APIs are designed to support non-block submission (e.g., io_uring), these new interfaces incur an extra burden into applications, who now need to deal with memory constraints (e.g., barriers, DMA-able memory).

To address this problem, we have created xNVMe (pronounced cross-NVMe): a user-space library that provides a generic layer for memory allocations and I/O submission, and abstracts the underlying I/O engine (e.g., libaio, io_uring, SPDK, and NVMe driver IOCTLs).

In this talk, we (i) present the design and architecture of xNVMe, (ii) give examples of how applications can easily integrate with it and (iii) provide an evaluation of the overhead that it adds to the I/O path.


File System Native Support of Zoned Block Devices: Regular vs Append writes

Naohiro Aota, Staff engineer, Western Digital

Abstract

Most file systems in use today have been designed assuming the ability to execute random read and write operations to the underlying storage device. This design assumption prevents correct operations with zoned block devices lacking random write capabilities, such as SMR hard disk and NVMe ZNS SSDs. Some filesystems must rely on special block layer drivers to ensure sequential writes (e.g., ext4 and the dm-zoned device mappers). Filesystems using a copy-on-write design are however good candidates for native zoned storage support.

This talk will first present different techniques allowing implementing support for zoned block devices in file systems. Linux f2fs support, available since kernel version 4.10, will be used as a first example. A more advanced support technique using the new NVMe ZNS Zone Append command will also be presented, and its application to the btrfs file system detailed. This is followed with a presentation of performance results obtained with btrfs using micro benchmarks, comparing the use of regular write commands and zone append write commands on zoned devices with btrfs performance on regular disks.


Getting Started with NVMe ZNS on QEMU

Dmitry Fomichev, R&D Technologist, Western Digital Corporation

Abstract

The new Zoned Namespaces (ZNS) feature set was recently ratified by NVM Express. This extension to the NVMe standard opens new possibilities for reducing flash storage costs and for achieving better system performance. These design goals, which are very often contradictory, are met by ZNS by leveraging the existing zoned block device abstraction first introduced with SMR HDDs. There is a great interest in this technology from cloud providers, file system developers and the storage engineering community as a whole.

Developing new applications or adapting existing new applications to NVMe ZNS constraints can be a challenging task depending on the design of the target application. To facilitate development activities, using virtual devices can be of great help, enabling an entire system IO stack debugging and step-by-step execution. The recently published QEMU PCI NVMe emulation driver code extension provides a comprehensive support for ZNS to meet this goal.

In this talk, an overview of NVMe ZNS protocol and features will be given and their implementation in QEMU ZNS device emulation driver explained. Examples of how to configure and use ZNS in QEMU will be given, first in its simplest form and then in more advanced configurations. Examples will be shown through a live demonstration using QEMU running on a laptop.


ZenFS, Zones and RocksDB - Who likes to take out the garbage anyway?

Hans Holmberg, Technologist, Western Digital Corporation

Abstract

Zoned Namespaces (ZNS) Command Set is an important and exciting new command set for NVMe™ devices. It exposes a zoned block storage interface between the host and the SSD, that allows the SSD to perfectly align the data to its media.

As a result, it allows applications to minimize SSD write amplification, improve throughput and latency, and extend the life of the SSD. However, to have full advantage of ZNS, host support is required (e.g., file-systems and database systems), such that application data structures onto the characteristics of these zones.

This talk presents ZenFS, a new RocksDB storage back-end which seamlessly creates an end-to-end integration with ZNS devices. We will show how ZenFS works with the constraints and advantages of ZNS SSDs, and how its co-design significantly improves RocksDB throughput, access latencies and capacity usage.


End To End Data Placement For Zoned Block Devices

Marc Acosta, Research Fellow, Western Digital Corporation

Abstract

End to End (E2E) Data Placement or intelligent placement of data onto media requires coordination between Applications, File System, and Zoned Block devices (ZBDs). If done correctly, E2E Data Placement with ZBDs will significantly reduce storage costs and improve application performance.

The talk will walk through state of the art database systems and define their data placement characteristics with the associated storage cost. Next, we discuss how E2E data placement can use the concept of a file to determine data associativity and efficiently store the file as zones on ZBDs. We will cover crucial ZBD metrics and present examples of how applications and file systems can be modified to be ZBD friendly. Methods to estimate the gains in throughput and storage cost reduction using E2E data placement and Zone Block devices will also be shown.

The attendees should leave the talk understanding how E2E data placement changes the role of Zoned Block Devices from storing LBAs to storing files. And how, by strategically mapping files, and it's data, as zones, one gain device capacity and reduces storage costs while improving both throughput and latency of your storage solution.