2020 Storage Developer Conference Abstracts

Break Out Sessions and Agenda Tracks Include:

Note: This agenda is a work in progress. Check back for updates on additional sessions as well as the agenda schedule.

Blockchain

Blockchain in Storage Why use it

Olga Buchonina, CEO, ActionSpot

Abstract

We will describe and showcase Burst Coin and IPFS technology while using Blockchain. In this presentation, we will explore how Blockchain can be used in Storage and why using blockchain can help to improve and address: latency, security, and data integrity.

Using NVMe and NVMe-oF can improve not only performance and latency but also address the scalability issues in Blockchain while addressing the market needs.


Off chain storage for Block Chain

Ramya Krishnamurthy, QA Architect [Test Expert], HPE

Ajay Kumar, Senior Test Specialist, HPE

Abstract

Storage infrastructure scalability for block chain can be provided with off chain storage: Off-chain data can be any data that is too large to be stored in the Block chain efficiently or requires the ability to be changed or deleted.

Off chain data is classified as any structured or unstructured data which cannot be stored in block-chain. For example, Media and documentation files like JPEGs and text files.

The proposed talk will also cover the aspects of storage infrastructure needed for providing off-chain storage since storage infrastructure is a critical element in block chain environment:

Flash/NVME technology for performance - Off-chain storage unburdens the block chain of storing large datasets but this can impact performance since physical disks are slow. Storage technologies such as NVMe reduce processor cycle requirements for storage and help is achieving performance be using NVME SSD drives

Ability to easily scale up/out to petabytes

  • We can easily scale up by adding additional disks/memory and scale out by adding additional nodes to the storage cluster thereby enabling to store petabytes of data

Backup-and-recovery via snapshot or continuous data synchronization with storage product functions

  • To ensure that offchain data is not lost we recommend backup of the data using virtual copies [snapshots] for easy recovery. Continuous data synchronization can be achieved by replication solution [either synchronous or asynchronous replication]

Inherent ability for data reduction with de-duplication/compression

  • We can achieve space savings using deduplication and compression technology This will lead to total storage of the system to be used wisely

Example of a vendor specific solution

Cloud

Deep dive on architecting storage application for the public cloud economy

Josh Salomon, Senior Principal Software Engineer, Red Hat

Orit Wasserman, Senior Principal Software Engineer, Red Hat

Abstract

The public cloud presents a new economic model on computing - "pay as you go" model in which you pay only for what you consume, but you pay for everything: compute, storage, networking, QoS and more. This model suggests new considerations for applications architecture that minimize public cloud cost, the presentation discusses the need for storage applications on public cloud, new architecture considerations (with applicability for all types of applications) and presents alternatives for cost reduction of storage applications in the public cloud. This presentation is a sequel to the presentation in SDC EMEA 2020, and includes more in depth discussion on instance storage usage as well as discussing ways to use spot instances for storage.


Is Gaming changing the Storage Architecture Landscape?

Leah Schoeb , Sr Developer Relations Manager, AMD

Abstract

The gaming industry is rapidly growing and so is the need for extreme high performance high capacity storage. Gaming storage requirements are driving new storage architectures and technologies to their limits from loading to playing to archiving. Recent advances in cloud technology have turned the idea of cloud gaming into a reality. Cloud gaming, in its simplest form, renders an interactive gaming application remotely in the cloud and streams the scenes as a video sequence back to the player over the Internet. This is an advantage for less powerful computational devices that are otherwise incapable of running high-quality games. This session will discuss real world performance with different types of games, for performance, interaction latency, and streaming quality. Revealing critical challenges toward the widespread deployment of cloud gaming and their storage requirements for best user experience.

Computational Storage

Computational Storage, from edge to cloud

Jerome Gaysse, Senior Technology and Market Analyst, Silinnov Consulting

Abstract

There are at least 3 main trends for computational storage architectures: SSD with embedded computing, SSD attached to computing acceleration card, SSD attached to smart NIC or smart HBA.

The challenge is to understand the real value of each solution and identify the use cases where it provides the best ROI, as such technologies may lead to major hardware and software changes in system design.

This talk presents an analysis of computational storage applications examples, from edge to cloud, highlighting the system benefits in term of power saving, performance increase and TCO reduction.


Deploying Computational Storage at the Edge

Scott Shadley , VP Marketing, NGD Systems

Abstract

With the growth of data generation at the Edge and need to get value from that data quickly, the market has run into a hurdle on how to get enough compute and processing with the available space, power and budget. The ability to deploy compute resources within the storage devices with Computational Storage is key to the growth of this market.

This presentation will discuss the deployment of small form factor, asic-based, solutions that bring value to end customers and platform developers. Including a specific use case to be showcased.


The True Value of Storage Drives with Built-in Transparent Compression: Far Beyond Lower Storage Cost

Tong Zhang, Chief Scientist, ScaleFlux

Abstract

This talk will reveal that, beyond reducing the data storage cost, emerging solid-state drives with built-in transparent compression bring exciting but largely unexplored opportunities to innovate the data storage management software stack (e.g., relational database, key-value store, and filesystem). The simple idea of integrating lossless data compression into storage drives is certainly not new and can trace back to decades ago. However, high-performance PCIe solid-state drives with built-in transparent compression have remained elusive on commercial market until recently. In addition to the straightforward storage cost saving, such a new breed of storage drives decouples the logical storage space utilization efficiency from the physical flash storage space utilization efficiency. As a result, it allows the data management software purposely “waste” the logical storage space in return for employing much simpler data structures and algorithms, without sacrificing the physical storage cost. Naturally, simpler data structures and algorithms come with higher performance and/or lower CPU/memory usage. This creates a large but unexplored space for re-thinking the data management software stack design. This talk will present our recent work on exploring this new territory in the context of relational database and key-value store. In particular, this talk will (1) introduce the basics of storage drives with built-in transparent compression and their implementation challenges, (2) discuss how one could configure or even slightly modify MySQL and PostgreSQL (the two most popular relational database) in order to significantly benefit from such storage drives in terms of both performance and cost, and (3) present a new open-source key-value store that is created from the scratch in order to take full advantage of such storage drives, which can achieve higher performance and efficiency than any existing key-value store solutions. As storage drives with built-in transparent compression are quickly entering the commercial market, it is our hope that this talk could inspire the data storage community to develop many more elegant ideas that can make future data management stack fully embrace and benefit from such storage drives.

Container Storage

Bring Agility and Data Management to your Hybrid Multicloud DevOps strategy with Trident

Ron Feist, Subject Matter Expert Elite for Hybrid Cloud, NetApp

Abstract

Container orchestrators such as Kubernetes enable the automation of deployment, scaling, and management of applications in your cloud of choice. How do you utilize move data on demand between your private and public clouds? Trident is an open-source storage orchestrator for containers maintained by NetApp that makes it trivial to connect the available cloud and on-premises storage options to your containers. This session will outline using Trident in several deployment scenarios for cloud-based containerized applications as well as on-premises storage management using the Container Storage Interface. We will also demonstrate how Trident allows you to move workloads between clouds, and how Astra will help you with automation and cataloging.


Leveraging Modern Network to Deliver Faster Storage to Database Workloads in Kubernetes

Amarjit Singh, Director, DevOps Solutions, Kioxia America

Abstract

Achieving optimal performance for Stateful database workloads has typically required the use of local storage drives that limited orchestration frameworks from scaling these workloads across a networked infrastructure.  With the advancement of storage protocols, such as NVMe and NVMe-oF and networking technologies such as RDMA, RoCE, SmartNICs, it is now possible to achieve DAS-like performance over a network.  Through resource disaggregation, compute and storage can scale independently of each other with the ability to dynamically allocate storage capacity and performance to applications as needed. In this presentation, we will discuss advancements in networking and storage technologies and how they are blending together to address the demands of modern workloads and scheduling frameworks (such as Kubernetes).  In the hands-on session, we will demonstrate how to leverage RDMA/RoCEv2 and TCP to integrate & provision faster storage on Kubernetes platforms.


Data Protection in a Kubernetes-Native World

Niraj Tolia, CEO, Kasten

Abstract

What is Kubernetes-native backup and does one really need backup with Kubernetes? Will Kubernetes environments stay stateless forever? Why doesn’t legacy VM-backup system work with containers? This talk gets to the bottom of these questions and more!

In particular, we will cover seven critical considerations for Kubernetes-native backup and show their importance in implementing a cloud-native backup strategy that will protect your business-critical data in a developer-focused platform:

  • Kubernetes Deployment Patterns
  • DevOps and "Shift Left"
  • Kubernetes Operator Challenges
  • Application Scale
  • Protection Gaps
  • Security
  • Ecosystem Integration

We will also cover the pitfalls of trying to retrofit legacy backup architectures into a cloud-native ecosystem but, more importantly, focus on the benefits of deploying a truly cloud-native backup solution.

Data Management

Data Sovereign Collaboration Platform for Autonomous Vehicles

Radha Krishna Singuru, DMTS - Senior Member, Wipro Technologies

Abstract

In today’s digital world, Data sovereignty becomes a big area of concern for various countries. Many countries are concerned about the data, that is generated locally, and being processed and persisted beyond their geographical control. This is particularly relevant in the age of cloud computing with global development centers for processing and persisting data.

Governments across the globe are enacting new and stringent laws for data sovereignty. This is pushing various industries specifically the autonomous vehicle industry to innovate more and create new business models for data sovereignty compliance. There is a need to build an innovative Rapid Collaboration Platform (RCP) that can address data sovereignty concerns on where the data will be stored, how it complies with local laws, ensure data privacy and data security while enabling to perform Business as Usual activities more efficiently.

AV RCP is a collaboration platform that enables Data Driven Development of Autonomous Driving built on the principles of distributed architecture to solve various challenges faced by autonomous vehicle development teams. It enables seamless access to vehicle test data to various users as software engineer, algorithm developer, applied scientist, ML engineer etc. It provides global search engine, native compute engine and various data management tools so that test data is available and accessible to various teams spread across various regions. The distributed architecture ensures data remains at the place where it is collected. This not only ensures data sovereignty but also optimizes on the network bandwidth usage and cost and helps improve the productivity of the overall AV development team. It supports hybrid cloud deployment options that can build on existing on premise infra investments as well as leverage latest technological capabilities from public clouds.

The same platform can be extended for other industry segments like Health, Auto, Energy, Utilities etc.

Data Protection and Data Security

Data Preservation & Retention 101

Thomas Rivera, Strategic Success Manager, VMware Carbon Black

Abstract

There are many instances in which the terms "retention" and "preservation" are used interchangeably and incorrectly. This can result in different and conflicting requirements that govern how the same information is maintained, how long it must be kept, and whether and how it is protected and secured. This session highlights the differences between retention and preservation.


OS Level Encryption for Superior Data Protection

Peter Scott, Senior Engineer, Thales, Inc

Abstract

While protecting data at rest, or live data, using a hardware-based approach is efficient and fast, it does not allow the flexibility of per file access control and data protection. The approach we have taken at Thales allows for per file access control and transparent data protection while providing the flexibility to rotate keys within a distributed key management system without effecting access. This solution covers a wide range of platforms, but this talk will be limited to the Windows implementation which leverages a Layered File System to achieve transparency. Some of the features that will be discussed include:

  • How to support per file access control in a distributed system
  • Managing access to files undergoing a transformation or key rotation in both local and network environments
  • Allowing for access to encrypted content while providing clear text access to files simultaneously

Diving into each of these topics, with side bars, we will provide the audience a clear picture of the complexities involved. For example, in a distributed environment how does one ensure that during key rotations all clients are using the correct key for data encryption for various ranges of the file without falling back to a single use access?

Integration with Windows subsystems such as the Cache and Memory manager will be covered to ensure the subtleties of supporting concurrent multi-data form access is not lost. As well as where to draw the line in terms of allowing the native file system to maintain some metadata information without losing robustness and flexibility in the design. We’ll answer this and more while covering the details of the design to achieve live data protection.

An understanding of the Windows layered driver model, particularly in the area of file systems and file system filters will help in understanding the topics discussed.

/etc

Novel Technique of High-Speed Magnetic Recording Based on Manipulating Pinning Layer in Magnetic Tunnel Junction-Based Memory by Using Terahertz Magnon Laser

Boris Tankhilevich, CEO, Magtera, Inc.

Abstract

An apparatus for novel technique of high-speed magnetic recording based on manipulating pinning layer in magnetic tunnel junction-based memory by using terahertz magnon laser is provided. The apparatus comprises a terahertz writing head configured to generate a tunable terahertz writing signal and a memory cell including a spacer that comprises a thickness configured based on Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction. The memory cell comprises two separate memory states: a first binary state and a second binary state; wherein the first binary memory state corresponds to a ferromagnetic sign of the Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction corresponding to a first thickness value of the spacer; and wherein the second binary memory state corresponds to an antiferromagnetic sign of the Ruderman-Kittel-Kasuya-Yosida (RKKY) interaction corresponding to a second thickness value of the spacer. The thickness of the spacer is manipulated by the tunable terahertz writing signal.

File Systems

Marchive: Extending MarFS to a Long Term Archive

Garrett Ransom, Scientist, Los Alamos National Laboratory

Abstract

In response to the ever increasing bandwidth, capacity, and resiliency requirements of HPC data storage, Los Alamos National Laboratory developed MarFS, an open source filesystem providing a near-POSIX interface atop abstracted data and metadata storage implementations. For several years, the MarFS library has provided a high bandwidth, high resiliency storage tier for production data, known as Campaign Storage. The success of MarFS in this context, as well as the flexible nature of its underlying data and metadata storage, has spurred interest in extending the codebase to support the data archive needs of the Laboratory. Known as Marchive, this archive system will provide long term stability by storing parity protected data objects across magnetic tape media and expose a batch interface for efficient data ingest and retrieval.

This presentation will review the concept of MarFS, describe the extension of that concept to form a Marchive system, and relate some of the more interesting solutions to have emerged from this effort.


Tracing and visualizing file system internals with eBPF superpowers

Suchakrapani Sharma, Staff Scientist, ShiftLeft Inc

Hani Nemati, Software Engineer, Microsoft

Abstract

Linux kernel storage stack consists of several interconnected layers including Virtual File System (VFS), block layer and device driver. VFS provides the main interface to userspace applications and it is where the files and directories are being handled. As we go deep, much of the accesses are translated to actual IO operations in the block layer in the kernel. Investigating storage performance issues requires a full insight into all these layers.

In this talk, we begin by discussing the journey of a simple filesystem call from userspace all the way into the kernel. We explain how tools like Ftrace can be used to understand control flow inside the kernel. Once we understand the “points of interest” in the control flow of how the kernel handles the request from userspace, we then move on to discuss eBPF based approaches to compute meaningful storage performance/security metrics. We will showcase this with our small and nifty framework that includes a visualization system with different graphical views that represent the collected information about disk accesses in a convenient way. The goal of our talk is not just to show “yet another iotop like tool”, but to highlight the versatility of eBPF VM in the linux kernel that now allows developing targeted, plug and play tools to gather precise data about a system’s activity for security and performance debugging. To this end, we will explain in-depth what actually happens when such targeted eBPF based probing is used to extract meaningful data from the kernel. We explain the plumbing behind simple observability tools such as biolatency, vfsstat etc. [1] that have been built using eBPF and how to build a custom tool yourself.

[https://github.com/iovisor/bcc#tools]

Keynote Speakers/General Sessions

Analog Memory-based techniques for Accelerating Deep Neural Networks

Sidney Tsai, Research Staff Member, Manager, IBM

Abstract

Deep neural networks (DNNs) are the fundamental building blocks that allowed explosive growth in machine learning sub-fields, such as computer vision and natural language processing. Von Neumann-style information processing systems are the basis of modern computer architectures.As Moore's Law slowing and Dennard scaling ended, data communication between memory and compute, i.e. the “Von Neumann bottleneck,” now dominates considerations of system throughput and energy consumption, especially for DNN workloads. Non-Von Neumann architectures, such as those that move computation to the edge of memory crossbar arrays, can significantly reduce the cost of data communication.

Crossbar arrays of resistive non-volatile memories (NVM) offer a novel solution for deep learning tasks by computing matrix-vector multiplication in analog memory arrays. The highly parallel structure and computation at the location of the data enables fast and energy-efficient multiply-accumulate computations, which are the workhorse operations within most deep learning algorithms. In this presentation, we will discuss our Phase-Change Memory (PCM) based analog accelerator implementations for training and inference. In both cases, DNN weights are stored within large device arrays as analog conductances. Software-equivalent accuracy on various datasets has been achieved in a mixed software-hardware demonstration despite the considerable imperfections of existing NVM devices, such as noise and variability. We will discuss the device, circuit and system needs, as well as performance outlook for further technology development.

Key Value

Key Value Standardized

William Martin, SSD I/O Standards, Samsung

Abstract

The NVMe Key Value (NVMe-KV) Command Set has been standardized as one of the new I/O Command Sets that NVMe Supports. Additionally, SNIA has standardized a Key Value API that works with the NVMe Key Value allows access to data on a storage device using a key rather than a block address. The NVMe-KV Command Set provides the key to store a corresponding value on non-volatile media, then retrieves that value from the media by specifying the corresponding key. Key Value allows users to access key-value data without the costly and time-consuming overhead of additional translation tables between keys and logical blocks. This presentation will discuss the benefits of Key Value storage, present the major features of the NVMe-KV Command Set and how it interacts with the NVMe standards, and present open source work that is available to take advantage of Key Value storage.

Machine Learning

Parallel Machine Learning Algorithms on Distributed Platforms

Shobha G, Parallel Machine Learning Algorithms on distributed Platform, R V College of Engineering

Abstract

Implementation of Machine Learning algorithms involves computationally intensive operations on large data sets and huge amount of time for executing the algorithm on a single node. The performance of ML algorithms can be improved if it is executed- in parallel and in distributed environment. This paper explores the HPCC Systems distributed architecture which is an open source software and has the ability to perform operations using Massively Parallel Processing (MPP). This paper also investigate a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. It has been found that the parallelized algorithm performs eight times better for a higher number of data points and takes exponentially less time as the number of data points increases.

NVMe

pynvme: an open, fast and extensible NVMe SSD test tool

He Chu, Engineer, Geng Yun Technology Pte. Ltd.

Abstract

SSD is becoming ubiquitous in both Client and Data Center markets. The requirements on function, performance and reliability are refreshed frequently. As a result, SSD design, especially the firmware, has been keeping upgrading and restructuring for the decade.

The test makes the change under control. However, the firmware test is not as mature as the software test. We have well developed methodologies, processes and tools for software. But the embedded platform, where the firmware executes, only provides the limited resources on computation and memory. So, it is difficult to run full test in the native embedded environment. Practically, SSD vendors run system tests with 3-rd party software, consuming huge resources. The existed tools lacks the flexibility to make efficient tests against vendor's own features and flaws. SSD developers need an infrastructure to implement their test scripts or programs in low cost. Our pynvme is just the answer.

Pynvme is open. It is not only an open-source project, but also a testing solution utilizing the open software ecosystem. We can use the mature testing software in the cloud era in SSD testing. Pynvme is very fast, even faster than FIO. It is based on a user-space driver which accesses NVMe drives directly and bypasses the overhead of the whole storage software stack in Linux kernel. Pynvme is extensible. We can access any PCIe configuration and BAR space to implement our own test dedicated NVMe driver in Python scripts. Based on pynvme, test developers can write and deploy test scripts efficiently with lower software and hardware budget.


ZNS: Enabling in-place updates and transparent high queue-depths

Javier Gonzalez, Principal Software Engineer, Samsung Electronics

Abstract

Zoned Namespaces represent the first step towards the standardization of Open-Channel SSD concepts in NVMe. Specifically, ZNS brings the ability to implement data placement policies in the host, thus providing a mechanism to lower the write-amplification factor (WAF), (ii) lower NAND over-provisioning, and (iii) tighten tail latencies. Initial ZNS architectures envisioned large zones targeting archival use cases. This motivated the creation of the ""Append Command” - a specialization of nameless writes that allows to increase the device I/O queue depth over the initial limitation imposed by the zone write pointer. While this is an elegant solution, backed by academic research, the changes required on file systems and applications is making adoption more difficult.

As an alternative, we have proposed exposing a per-zone random write window that allows out-of-order writes around the existing write pointer. This solution brings two benefits over the “Append Command”: First, it allows I/Os to arrive out-of-order without any host software changes. Second, it allows in-place updates within the window, which enables existing log-structured file systems and applications to retain their metadata model without incurring a WAF penalty.

In this talk, we will cover in detail the concept of the random write window, the use cases it addresses, and the changes we have done in the Linux stack to support it.


Use cases for NVMe-oF for Deep Learning Workloads and HCI Pooling

Nishant Lodha, Director of Technologies, Marvell

Abstract

The efficiency, performance and choice in NVMe-oF is enabling some very unique and interesting use cases – from AI/ML to Hyperconverged Infrastructures. Artificial Intelligence workloads process massive amounts of data from structured and from unstructured sources. Today most deep learning architectures rely on local NVMe to serve up tagged and untagged datasets into map-reduce systems and neural networks for correlation. NVMe-oF for Deep Learning infrastructures enables a shared data model to ML/DL pipelines without sacrificing overall performance and training times. NVMe-oF is also enabling HCI deployment to scale without adding more compute, enabling end customers to reduce dark flash and reduce cost. The talk explores these and several innovative technologies driving the next storage connectivity revolution.


High-performance RoCE/TCP solutions for end-to-end NVMe-oF communication

Jean-Francois Marie, Chief Solution Architect, Kalray

Abstract

Abstract: Exploiting the full SSD performance in scalable disaggregated architectures is a continuous challenge. NVMe/TCP, released in 2018, enables a broader sharing of distributed storage resources. It complements NVME-oF over RDMA, avoiding performance degradation over distant links and simplifying the deployment. However this comes at the cost of an heaviest networking stack and requires the latest Linux kernels. In this talk, we will analyze the differences between RoCE and TCP, and show how to eliminate bottlenecks, achieving best-in-class performance for both protocols in an end-to-end NVMe-oF communication. We will demonstrate also how this solution can be OS agnostic, ensuring a seamless integration of NVMe-oF in today datacenter.


NVMe over Fabrics in the Enterprise

Rupin Mohan, Director R&D, CTO SAN, HPE

Abstract

This session will discuss application and use case examples leveraging the NVMe 1.4 and NVMe-oF 1.1 specifications. Get a unique perspective on how NVMe technology and NVMe over Fabrics is evolving to redefine next generation SAN and the key fabric requirements to enable this new frontier in the next generation enterprise data centers. This session will cover:

  • Shift of NVMe drives from inside servers to outside the servers – disaggregated storage
  • Second order effect of this, would be like its 1999 ….Concept of where NVMe over Fabrics is versus Fibre Channel circa 1996-1999
  • Introduce the idea of a centralized discovery controller, get NVMe-BOD permission of course, and how the industry is getting together led by HPE on the need for centralized name services and the concept of a ‘fabric’ which is missing in Ethernet right now
  • The opportunity to drive the same technology across on-prem, hybrid and cloud networks in terms of storage networking
  • Lastly, the concept of NVMe over Fabrics connected drives and how the new storage architectures will need to have even a bigger focus and reliance on the storage fabric and the fabric will be ubiquitous and will need to be a single fabric and there will be synergies across both front-end, back-end and inside the storage controllers…

Optimizing user space NVMe-oF TCP transport solution with both software and hardware methodologies

Ziye Yang, Staff Cloud Software Engineer, Intel

Abstract

In this talk, we would like to update the development status of SPDK user space NVMe/TCP transport and the performance optimizations of NVMe/TCP transport in both software and hardware areas. In the recent one year, there are great efforts to optimize the NVMe-oF transport performance in software especially with the kernel TCP/IP stack, such as (1) Trade-off memory copy cost to reduce system calls to achieve optimal performance of the NVMe/TCP transport on top of the kernel TCP/IP stack; (2) Use asynchronized writev to improve the IOPS; (3) Use libaio/liburing to implement group based  I/O submission for write operations. We also spent some efforts to investigate user space TCP/IP stack (e.g., Seastar) to explore the performance optimization opportunity.  In this talk, we also share Intel’s latest effort to optimize the NVMe/TCP transport in SPDK using Application Device Queue (ADQ) technology from Intel 100G NICs, which improves NVMe/TCP transport performance significantly.  We will talk about how SPDK can export the ADQ feature provided by Intel's new NIC into our common Sock layer library to accelerate the NVMe-oF TCP performance and share the performance data with Intel's latest 100Gb NIC (i.e., E810).  ADQ significantly improves the performance of NVMe/TCP transport in SPDK including reduced average latency, significant reduction in long tail latency, and much higher IOPS.


xNVMe: Programming Emerging Storage Interfaces for Productivity and Performance

Simon Lund, Staff Engineer, Samsung

Abstract

The popularity of NVMe has gone beyond the limits of the block device. Currently, NVMe is standardizing Key-Value (KV) and Zoned (ZNS) namespaces, and discussions on the standardization of computational storage namespaces have already started.

While modern I/O submission APIs are designed to support non-block submission (e.g., io_uring), these new interfaces incur an extra burden into applications, who now need to deal with memory constraints (e.g., barriers, DMA-able memory).

To address this problem, we have created xNVMe (pronounced cross-NVMe): a user-space library that provides a generic layer for memory allocations and I/O submission, and abstracts the underlying I/O engine (e.g., libaio, io_uring, SPDK, and NVMe driver IOCTLs).

In this talk, we (i) present the design and architecture of xNVMe, (ii) give examples of how applications can easily integrate with it and (iii) provide an evaluation of the overhead that it adds to the I/O path.

Persistent Memory

Mortimer: A high performance scale out storage for persistent memory and NVMe SSDs

Anjaneya Chagam, Cloud Architect, Intel Corporation

Abstract

Mortimer is an open source software that is designed from ground up to take advantage of byte addressable persistent memory to deliver high performance low latency storage. Mortimer uses persistent memory for meta-data lookups and fast write buffering. Buffered writes from persistent memory are flushed to NVMe SSDs in the background. Highly optimized lock-less algorithms are used to exploit DRAM band width while taking advantage of byte-addressable persistent storage for meta-data durability. Data path is optimized using NVMeoF with distributed control plane and data plane extensions to deliver seamless application integration. Poll mode, lockless design pattern is adapted for entire data path to achieve optimum usage of compute resources. Mortimer is built on top of existing open source development kits (SPDK, PMDK etc.) and adapts proven open source techniques such as consistent hashing, consensus protocols to deliver distributed storage semantics. This session covers Mortimer architecture, algorithms, roadmap and benchmarking data to demonstrate how persistent memory is used to deliver low latency scale out storage.


Persistent Memory + Enterprise-Class Data Services = Big Memory

Charles Fan, CEO and co-founder, MemVerge

Abstract

Data-centric applications such as AI/ML, IoT, Analytics and High Performance Computing (HPC) need to process petabytes of data with nanosecond latency. This is beyond the current capabilities of in-memory architectures because of DRAM’s high cost, limited capacity and lack of persistence. As a result, the growth of in-memory computing had been throttled as DRAM is relegated to only the most performance-critical workloads.

In response, a new category of Big Memory Computing has emerged to expand the market for memory-centric computing. Big Memory Computing is where the new normal is data-centric applications living in byte-addressable, and much lower cost, persistent memory. Big Memory consists of a foundation of DRAM and persistent memory media plus a memory virtualization layer. The virtualization layer allows memory to scale-out massively in a cluster to form memory lakes, and is protected by new memory data services that provide snapshots, replication and lightning fast recovery.

The market is poised to take off with IDC forecasting revenue for persistent memory to grow at an explosive compound annual growth rate of 248% from 2019 to 2023.


Update on the JEDEC DDR5 NVRAM Specification

Bill Gervasi, Principal Systems Architect, Nantero

Abstract

A generation of new non-volatile memories (NVMs) potentially capable of working with, or replacing, SDRAM are in design now. Memory controller designers will want to exploit the advantages of these new memories, such as zero-power standby and the elimination of content refresh. JEDEC is in the process of defining the DDR5 NVRAM specification, the first of the detailed definitions of these new NVMs, as an enablement for systems designers to learn about and be ready for the coming wave.


Challenges and Opportunities as Persistence Moves Up the Memory/Storage Hierarchy

Jim Handy, General Director, Objective Analysis

Thomas Coughlin, President, Coughlin Associates

Abstract

While the storage industry is wrestling to incorporating persistent memory in its system and software designs, even bigger challenges lurk around the corner. Systems include a lot of other memory besides DIMMs. The near future will bring persistent caches, and even persistent registers! As CPUs move to smaller semiconductor processes, MRAM and other emerging memory types will replace on-chip SRAM to completely change the nature of cache memory. From here, it will be a small step to make persistent internal CPU registers. With all of these changes, software can not only be designed to perform better as a result of the persistence, but will need to be designed in a way that prevents persistence from undermining security and reliability. This presentation, by the authors of an annual research report on emerging memories, will show how and why memory at all levels will become persistent and will reflect on problems that must be solved both to use it effectively, and to prevent persistence from causing trouble.


Exploring New Storage Paradigms and Opportunities with Persistent Memory Technology

Daniel Waddington, Research Staff Member, IBM

Abstract

Emerging persistent memory technologies, such as sub-microsecond non-volatile DIMMs and in/near-memory compute, are creating new opportunities for application performance improvement by enabling memory-storage convergence. This convergence defines a new paradigm by blurring the boundaries between compute-data and stored-data. In turn, the need to transform and move data (locally or across the network) can be dramatically reduced, leading to orders of magnitude improvement in performance over existing methods.

In this talk, we will explore memory-technology trends and the associated system architecture challenges. We will highlight our current work at IBM Research, known as MCAS, to develop a new converged architecture. MCAS (Memory Centric Active Storage) evolves the conventional key-value paradigm to enable seamless data movement and arbitrary in-place operations on structured data in memory, while also providing traditional storage capabilities such as durability, versioning, replication and encryption.


Is Persistent Memory Persistent?

Terence Kelly

Abstract

Preserving application data integrity is a paramount duty of computing systems. Failures such as power outages are major perils: A sudden crash during an update may corrupt data or effectively destroy it by corrupting metadata. Applications protect data integrity by using update mechanisms that are atomic with respect to failure; such mechanisms promise to restore data to a consistent state following a crash.

Unfortunately, the checkered history of failure-atomic update mechanisms precludes blind trust. Widely used relational databases and key-value stores often fail to uphold their transactionality guarantees [Zheng et al., OSDI '14]. Lower on the stack, durable storage devices may corrupt or destroy data when power is lost [Zheng et al., FAST '13]. Emerging non-volatile memory (NVM) hardware and corresponding failure-atomic update mechanisms strive to avoid repeating the mistakes of earlier technologies, as do software abstractions of persistent memory for conventional hardware [the topic of my SDC 2019 talk]. Healthy skepticism, however, demands firsthand evidence that such systems deliver on their integrity promises.

Prudent developers and operators follow the maxim, "train as you would fight." Software that must tolerate abrupt power failures should demonstrably survive such failures in pre-production tests or "Game Day" failure-injection testing on production systems. In the past, my colleagues and I extensively tested our crash-tolerance mechanisms against power failures, but we did not document the tribal knowledge required to practice this art.

This talk describes the design and implementation of a simple and cost-effective testbed for subjecting applications running on a complete hardware/software stack to repeated sudden whole-system power interruptions. The testbed is affordable, runs unattended indefinitely, and performs a full power-off/on test cycle in one minute. The talk will furthermore present my findings when I used such a testbed to evaluate a crash-tolerance mechanism for persistent memory by subjecting it to over 50,000 power failures. Any software developer can use this type of testbed to evaluate crash-tolerance software before releasing it for production use. Application operators can learn from this talk principles and techniques that they can apply to power-fail testing their production hardware and software.

A peer-reviewed companion paper that covers all of the material in the talk and that provides additional detail will be published prior to the talk; attendees are invited but not required to read the paper before the talk.


How can persistent memory make database faster, and how could we go ahead?

Takashi Menjo, Researcher, Nippon Telegraph and Telephone Corporation

Abstract

Persistent memory (PMEM) is a fast-to-access non-volatile memory device by itself. To get the best out of it, however, we need to modify software design as PMEM-aware. In this presentation, I will talk about my case study to improve the transaction processing performance of PostgreSQL, an open-source database management system (DBMS). I redesigned a typical two-level transaction logging architecture that had consisted of DRAM and persistent disk like HDD or SSD, into a single-level one on PMEM.

Database researchers and engineers have optimized logging architectures, assuming that persistent disk has been slower than main memory and not good at random access. Therefore, a DBMS has buffered and serialized logs on DRAM then output them sequentially to disk. Such a two-level architecture has improved performance.

However, using PMEM instead of disk in the two-level architecture, I got worse performance than the single-level one due to overhead. This is because PMEM is as fast as DRAM and is better at random access than disk. To clarify that, I will present differences between the designs of the two logging architectures and their performance profiling results.

I also tried to redesign some other architectures but gave them up because there seemed to be limited or little chance to improve performance by fixing them. Those histories will lead us to know what characteristics are suitable for PMEM and what components are worth being changed as PMEM-aware.


Accelerate Big Data Workloads with HDFS Persistent Memory Cache

Feilong He, Machine Learning Engineer, Intel Corporation

Jian Zhang, Software Engineer Manager, Intel Corporation

Abstract

HDFS (Hadoop Distributed File System) cache feature has a centralized cache mechanism which supports end users to just specify a path to cache corresponding HDFS data. HDFS cache could provide significant performance benefits for queries and other workloads whose high volume of data is frequently accessed. However, as DRAM is used as cache medium, HDFS cache might cause performance regression for memory intensive workloads. So its usage is limited especially in scenarios where memory capacity is insufficient.

To overcome the limitations of HDFS DRAM cache, we introduced persistent memory to serve as cache medium. Persistent memory represents a new class of memory storage technology that offers high performance, high capacity and data persistence at lower cost, which makes it suitable for big data workloads. In this session, the attendees can gain a lot of technical knowledge in HDFS cache and learn how to accelerate workloads by leveraging HDFS persistent memory cache. We will first introduce the architecture of HDFS persistent memory cache feature, then present the performance numbers of micro workloads like DFSIO and industry standard workloads like TPC-DS. We will showcase that HDFS persistent memory cache can bring 14x performance speedup compared with no HDFS cache case and 6x performance speedup compared with HDFS DRAM cache case. With data persistence characteristic, HDFS persistent memory cache can help users reduce cache warm-up time in cluster restart situations, which will also be demonstrated. Moreover, we will discuss our future work, such as potential optimizations and HDFS lazy-persistent write cache support with persistent memory.


RPMP: A Remote Persistent Memory Pool to accelerate data analytics and AI

Jian Zhang, Software Engineer Manager, Intel Corporation

Abstract

Persistent memory represents a new class of memory storage technology that offers high performance, high capacity and data persistence at lower cost to bridge the performance and cost gap between DRAM and SSDs. There are a broad usage scenario of persistent memory for data analytics and AI workloads, however, remote access to persistent memory posed lots of challenges to persistent memory applications.

RDMA is an attractive technology can be used for remote memory access. It leverages an RDMA network cards to offload the data movement from the CPU to each system’s network adapter, which improve application performance, utilization, and is capable of enable the applications to take full advantages of persistent memory devices. In this work, we proposed an innovative distributed storage system that leverages persistent memory as storage medium with a key-value storage engine, efficient RDMA powered network messenger as network layer, and used consistent hash algorithm to provided configurable data available and durability to up level applications. RPMP provides low level key-value APIs make it suitable for performance critical applications, it also implements several optimizations including a circular buffer to improve write performance, a persistent memory RDMA memory region combined technology to improve read performance.

Experiment performance numbers will be also presented, we will present the micro benchmark performance of the key-value store as well as decision support query performance of RPMP as fully disaggregated shuffle solution in Spark based data analytics.


Persistent Memory on eADR Systems

Andy Rudoff, Persistent Memory SW Architect, Intel

Abstract

Persistent Memory programming has well known challenges around cache flushing and the SNIA programming model includes the possibility of platforms where the CPU caches are considered persistent and need no flushing.  For Intel platforms this capability is known as eADR.  Andy will describe how eADR works and what additional performance and other features are enabled on eADR systems.

SMB

SMB3 over QUIC - Files Without the VPN

Sudheer Dantuluri, Software Engineer, Microsoft

Abstract

The SMB3 protocol is broadly deployed in enterprise networks and contains strong protection to enable its use more broadly. However, historically port 445 is blocked and management of servers on TCP have been slow to emerge. SMB3 now is able to communicate over QUIC, a new internet standard transport which is being broadly adopted for web and other application access.

In this talk, we will provided updated details on the SMB3 over QUIC protocol and explore the necessary ecosystem such as certificate provisioning, firewall and traffic management and enhancements to SMB server and client configuration.


SMB3 POSIX Extensions Phase 2 ... Now that they are in what is next?

Steven French, Principal Software Engineer, Microsoft

Abstract

With another year of work on improving the SMB3.1.1 Protocol Extensions for Linux/POSIX and its implementation in servers and the Linux client ... where do we go from here?

For optimal interoperability between Linux clients and NAS appliances, Servers and the Cloud, what features should be added?

Examining workload requirements, new Linux syscalls and Linux functional test compliance (subtests within the standard ""xfstest"" suite that fail or are skipped due to missing features) has not only shown areas where the Linux VFS could be improved (e.g. in how it temporary files are created and how swapfiles are locked) but also areas where the SMB3.1.1 protocol and its POSIX extensions can be extended. This is an exciting time, with SMB3.1.1 use in Linux increasing to access an ever wider world of storage targets securely and efficiently. This presentation will help explore how to make it even better.


The Future of Accessing Files remotely from Linux: SMB3.1.1 client status update

Steven French, Principal Software Engineer, Microsoft

Abstract

Improvements to the SMB3.1.1 client on Linux have continued at a rapid pace over the past year. These allow Linux to better access Samba server, as well as the Cloud (Azure), NAS appliances, Windows systems, Macs and an ever increasing number of embedded Linux devices including those using the new smb3 kernel server Linux (ksmbd). The SMB3.1.1 client for Linux (cifs.ko) continues to be one of the most actively developed file systems on Linux and these improvements have made it possible to run additional workloads remotely.

The exciting recent addition of the new kernel server also allows more rapid development and testing of optimizations for Linux.

Over the past year ...

  • performance has dramatically improved with features like multichannel (allowing better parallelization of i/o and also utilization of multiple network devices simultaneously), with much faster encryption and signing, with better use of compounding and improved support for RDMA
  • security has improved and alternative security models are now possible with the addition of modefromsid and idsfromsid and also better integration with Kerberos security tooling
  • new features have been added include the ability to swap over SMB3 and boot over SMB3
  • quality continues to improve with more work on 'xfstests' and test automation
  • tooling (cifs-utils) continue to be extended to make use of SMB3.1.1 mounts easier

This presentation will describe and demonstrate the progress that has been made over the past year in the Linux kernel client in accessing servers using the SMB3.1.1 family of protocols. In addition recommendations on common configuration choices, and troubleshooting techniques will be discussed.


Samba locking architecture

Volker Lendecke, Developer, SerNet GmbH

Abstract

To implement share modes, leases and oplocks in a multi-process environment like Samba, inter-process communication to coordinate locking information and state changes must be fast, reliable and understandable. From system V shared memory and UDP sockets Samba over time has adapted modern forms of shared memory and mutexes. Samba's implementation of the locking state has several layers:

  • tdb is a raw key/value store
  • dbwrap is an abstraction on tdb, allowing alternative backend k/v stores
  • dbwrap_watch enables processes to monitor changes of values
  • g_lock implements per-record locking for exclusive operations

On top of that stack Samba implements something similar to the MS-SMB2 concept of GlobalOpenTable and other global data structures required to implement SMB. The goal of this talk is to deepen understanding and to allow possible improvements and alternative implementations. For example, right now the tables are implemented only for local access and for clustered access using ctdb. It should be possible to extend this to other implementations of distributed K/V stores. This talk will provide a foundation to extend Samba beyond ctdb.

Solid State Storage Solutions

Enabling Ethernet Drives

Mark Carlson, Principle Architect, Kioxia

Abstract

The SNIA has a new standard that enables SSDs to have an Ethernet interface, The Native NVMe-oF Drive specification defines pin outs for common SSD connectors. This enables these drives to plug into common platforms such as eBOFs (Ethernet Bunches of Flash).

This talk will also discuss the latest Management standards for NVMe-oF drives. Developers will learn about how to program and use these new types of drives.


An SSD for Automotive Applications

Bill Gervasi, Principal Systems Architect, Nantero

Abstract

The next generation of cars will require significant improvements in data management. Information for drive management, sensors, network connectivity, high resolution displays, security, in-car entertainment, and other sources will require a rethinking of storage devices. This presentation details efforts under way in JEDEC to define a new Automotive SSD standard to address these new requirements.

Storage Architecture

Quantum computing

Brian Eccles, Principal Analyst, IBM

Abstract

Quantum computing is arriving. With quantum computers we can tackle problems in entirely new ways. AI/ML, simulations, optimization and modelling physical processes are areas where quantum computing may show earliest impact. In this session we'll cover what is quantum computing, why it is so different from classical computing, why it is important, applications, early usage, resources available to developers, and how developers can get started with access to real quantum systems, for free, by the end of the day.

We'll also look at the potential impact of quantum computing on today's encryption approaches, implications for long-term storage, and work already underway to safeguard long term data retention and future data storage in a post quantum world.


When Highly Available storage systems are not: real life is not necessarily what you'd expect!

Jody Glider, Principal Storage Architect, Cloud Architecture and Engineering, SAP

Abstract

A dominant architecture for storage arrays has been some variant of an HA-pair where a set of drives is connected to two separate data paths (aka controllers). Generally these systems have been designed to continue to provide service after any single hardware failure, yet experience shows that single faults have caused disruption in data service...and not for the reasons you might think! This talk describes analysis performed on a set of storage service disruptions over a period of two years, points out some common patterns, shares some thoughts about possible improvements, and most of all asks for help in contemplating what improvements will lead to even better reliability in storage service.


Why do customers need media path redundancy in a storage array with simple low-latency hardware paths?

Mahmoud Jibbe, Technical Director, NetApp

Joey Parnell, Sr Architect, NetApp

Abstract

Many emerging white box storage systems replace mature, robust drive networks with simple point-to-point drive port topologies. The presentation provides fault tolerance where each storage controller has a dedicated port to a drive, but no hardware path to other port(s) to that drive. The redundant access to media is provided via Non-Transparent Bridging (NTB), Ethernet, or other networks to overcome lack of redundancy in the topology


Deep Compression at Inline Speed for All-Flash Array

Chris Mao, Principal Engineer, Pure Storage

Abstract

The rapid improvement of overall $/Gbyte has driven the high performance All-Flash Array to be increasingly adopted in both enterprises and cloud datacenters. Besides the raw NAND density scaling with continued semiconductor process improvement, data reduction techniques have and will play a crucial role in further reducing the overall effective cost of All-Flash Array.

One of the key data reduction techniques is compression. Compression can be performed both inline and offline. In fact, the best All-Flash Arrays often do both: fast inline compression at a lower compression ratio, and slower, opportunistic offline deep compression at significantly higher compression ratio. However, with the rapid growth of both capacity and sustained throughput due to the consolidation of workloads on a shared All-Flash Array platform, a growing percentage of the data never gets the opportunity for deep compression.

There is a deceptively simple solution: Inline Deep Compression with the additional benefits of reduced flash wear and networking load. The challenge, however, is the prohibitive amount of CPU cycles required. Deep compression often requires 10x or more CPU cycles than typical fast inline compression. Even worse, the challenge will continue to grow: CPU performance scaling has slowed down significantly (breakdown of Dennard scaling), but the performance of All-Flash Array has been growing at a far greater pace.

In this talk, I will explain how we can meet this challenge with a domain-specific hardware design. The hardware platform is a FPGA-based PCIe card that is programmable. It can sustain 5+Gbyte/s of deep compression throughput with low latency for even small data block sizes (TByte/s BW and <10ns of latency) and the almost unlimited parallelism available on a modern mid-range FPGA device. The hardware compression algorithm is trained with a vast amount of data available to our systems. Our benchmarks show it can match or outperform some of the best software compressors available in the market without taxing the CPU.

Accelerate artificial intelligence IoT use cases with storage tiering and shared storage at the edge

Joey Parnell, Sr Architect, NetApp

Mahmoud Jibbe, Technical Director, NetApp

Abstract

Transmitting data from Internet-of-Things (IoT) edge devices to core data centers to perform resource intensive artificial intelligence (AI) use cases is costly in terms of network bandwidth and latency. Alternatively, placing hardware resources such as GPUs at the edge to perform those operations locally and reduce network congestion and latency can be prohibited by cost and power requirements.

Through use of shared storage presented by the IoT device, more compute-intensive iterative refinement training can be performed either in the core data center or in cloud analytics platforms, and updated AI inference data transmitted back to IoT devices to provide customized training to improve reliability and reduce false positive rates.

By adding flash storage and extending the shared storage between IoT devices, distributed applications can divvy up work to idle or under-utilized devices, store the results locally, and send metadata to the originating device about where to read the results. This allows an IoT device to coordinate and complete AI tasks that may exceed the computing capabilities or the latency requirements of the singular device and minimizes the data traveling to and from data centers and clouds.

Finally, critical IoT data that is transmitted to core data centers can be either regularly archived to cloud or protected by disaster recovery solutions in the cloud. Minimize risk without significantly increasing cost by selectively using cloud resources. Store data on premises and mount that data as a target from the cloud via a gateway to perform analytics and transmit back only the results to the data center, or archive data for long term retention.

  1. Satisfy real-time IoT AI use cases with higher fidelity and without significantly increasing cost by providing shared storage in the IoT device to allow other devices to perform work and remotely update the local inference models.
  2. Add a layer of flash to IoT devices for data tiering provides the capability to perform AI tasks in a distributed fashion to utilize idle resources.
  3. Protect valuable IoT data at lower cost by selectively using cloud resources for archive and disaster recovery solutions.

Data Centers Need To Become Data-Centric

Pradeep Sindhu, Co-Founder & CEO, Fungible

Abstract

Presentation Abstract:In today’s hyper-connected world, data centers are processing more data than ever before, inventions like machine learning and artificial intelligence have enabled us to gain deep insights from the data generated from online shopping, social media, location services, etc. As these technologies grow, the demand for these data insights in real-time grow as well. With Moore’s Law reaching its limits, it is clear that the scale-out infrastructure powering our data centers isn’t keeping up with today’s demands. This presentation will discuss how the current, compute-centric approach to data centers is no longer viable and will dive into why data centers must pivot to a data-centric model. This model will supercharge data centers, allowing them to handle the increasing demands of the 21st century.


SmartNIC composable framework for flexible system partition

Remy Gauguey, Sr Software Architect - Data Center Business Unit, Kalray

Abstract

The variety of architectures, use-cases and workloads to be managed by Data Center appliances is increasing. It is driving a need for more and more flexibility in the system partition. This paper describes the architecture of a modular framework relying on standard modules and APIs such as Virtio or SPDK DBEV, and leveraging the parallelism of manycore processors. Mixing networking, storage or RDMA services, and taking advantage of hardware features such as SR-IOV, it allows for building efficient and compact SmartNICs combining a 200GE, PCIe Gen4 fast path with many offloading and value-added services. This SmartNIC architecture is a key enabler for many applications including Bare Metal Cloud, Software-Defined Network using OpenVSwitch, or advanced storage I/O servers.


Using Block Translation and Atomic Updates to Optimize All Flash Array Storage

Douglas Dumitru, CTO, EasyCo LLC

Abstract

Block translation is often used to linearize writes. This is especially useful for Flash SSDs as well as SMR hard disks. It likewise improves write performance in parity-based arrays.

Block translation has an additional capability that can be exploited to create whole new storage paradigms. Block translation lets you construct efficient atomic updates that let you use nearly 100% of your arrays’ write bandwidth to store actual, useful, data. Atomic linearization opens an array of possibilities. Here are some examples that have actually been implemented:

  • • Write a file system that lets you create hundreds of thousands of small files in less than a second, from a single thread, with greater than 90% space efficiency.
  • Export blocks to other file systems or a SAN with near array speed writes regardless of workload block size or pattern.
  • Reduce data with in-line compression and de-duplication without suffering a massive memory footprint or glacial performance.

The magic which makes such things possible is efficient, linear, generation based, atomic writes. Instead of scattering data, metadata, allocation bitmaps, and journal elements across the array, this solution writes elements together as part of a linear, atomic, string.

  • Linear IO: Each atomic write maintains absolute linearity and optimal alignment at the SSD and the array level.
  • Inline: Each atomic write contains the actual, live, data. There are no additional copies or journals.
  • Merge-able: Each atomic write can be appended with new data. Thousands of transactions can be combined eliminating intermediate updates for maximum bandwidth and space efficiency.
  • Flexible: Atomic updates can include any collection of data blocks or even the absence of blocks. You are not limited in what you can put into a transaction.
  • Variable Block Size: The blocks that you can store can contain from 16 bytes to 1 megabyte or more of payload for a single LBA. You can map your storage structure directly to blocks, even if the structure has variable sized elements.
  • Safe: All IO is validated end-to-end with CPU-assisted checksums.
  • Optimized for Bandwidth: The update structure allows for hundreds of megabytes in a single IO “write segment”. Big data can finally move at device speed.

This structure not only optimizes performance, but cost. SSDs are used with ideal write workloads, lowering both wear and cost. Linear writes and alignment mean that erasure codes out-perform mirroring and reach theoretical device speed. One user summed it up well. “We write faster than we read”.


SmartNIC composable framework for flexible system partition

Remy Gauguey, Sr Software Architect - Data Center Business Unit, Kalray

Abstract

The variety of architectures, use-cases and workloads to be managed by Data Center appliances is increasing. It is driving a need for more and more flexibility in the system partition. This paper describes the architecture of a modular framework relying on standard modules and APIs such as Virtio or SPDK DBEV, and leveraging the parallelism of manycore processors. Mixing networking, storage or RDMA services, and taking advantage of hardware features such as SR-IOV, it allows for building efficient and compact SmartNICs combining a 200GE, PCIe Gen4 fast path with many offloading and value-added services. This SmartNIC architecture is a key enabler for many applications including Bare Metal Cloud, Software-Defined Network using OpenVSwitch, or advanced storage I/O servers.


Storage Networking

A QUIC Introduction

Lars, Eggert, Technical Director, NetApp

Abstract

QUIC is a new UDP-based transport protocol for the Internet, and specifically, the web. Originally designed and deployed by Google, it already makes up 35% of Google's egress traffic, which corresponds to about 7% of all Internet traffic. The strong interest by many other large Internet players in the ongoing IETF standardization of QUIC is likely to lead to an even greater deployment in the near future. This talk will highlight:

  • Unique design aspects of QUIC
  • Differences to the conventional HTTP/TLS/TCP web stack
  • Early performance numbers
  • Potential side effects of a broader deployment of QUIC

Smart Fabrics: Building Self-Healing Fibre Channel Networks

Brandon Hoff, Director, Product Management, Fibre Channel Industry Association, Broadcom Inc.

Rupin Mohan, Director R&D, CTO SAN, HPE

Abstract

IT administrators are faced with a surge in digital demands while at the same time being overloaded with issue isolation and troubleshooting performance problems. Given their demanding workload, wasted time becomes a stumbling block for the digital businesses they support. These administrators are being judged by a new set of rules: accelerate IT delivery and increase focus on digital transformation. Fabric Notifications, a new solution from the INCITS T11 Committee, enables hosts and Fibre Channel Fabrics to collaborate and identify and remediate events that cause performance problems on storage area networks. Today, the lossless, low-latency, high-performance storage connectivity that Fibre Channel delivers makes it the trusted technology for enterprise customers and a majority of networked block storage. Fabric Notifications builds on the benefits of Fibre Channel by sharing information between the Fabric and Hosts, enabling the Fabric and Hosts to collaborate and remediate performance problems. This session will discuss what Fabric Notifications are, why they are important, the benefits of freeing up an IT administrator’s time, and how developers can take advantage of Fabric Notifications in their products.


Understanding Compute Express Link: A Cache-coherent Interconnect

Debendra Das Sharma, Intel Fellow; Director of I/O Technology and Standards Group, Intel

Abstract

Compute Express Link™ (CXL™) is an industry-supported cache-coherent interconnect for processors, memory expansion, and accelerators.

Datacenter architectures are evolving to support the workloads of emerging applications in Artificial Intelligence and Machine Learning that require a high-speed, low latency, cache-coherent interconnect. The CXL specification delivers breakthrough performance, while leveraging PCI Express® technology to support rapid adoption. It addresses resource sharing and cache coherency to improve performance, reduce software stack complexity, and lower overall systems costs, allowing users to focus on target workloads.

Attendees will learn how CXL technology maintains a unified, coherent memory space between the CPU (host processor) and CXL devices allowing the device to expose its memory as coherent in the platform and allowing the device to directly cache coherent memory. This allows both the CPU and device to share resources for higher performance and reduced software stack complexity. In CXL, the CPU host is primarily responsible for coherency management abstracting peer device caches and CPU caches. The resulting simplified coherence model reduces the device cost, complexity and overhead traditionally associated with coherency across an I/O link.

Storage Performance / Workloads

Realistic Synthetic Data at scale: Influenced by, but not production data

Mehul Sheth, Senior Performance Engineer, Druva Data Solutions Pvt. Ltd.

Abstract

To have a high confidence in a product, testing it against a data set which resembles production data is must. The challenge is in generating data for testing that represents production. The data in production is not predictable, it doesn’t follow simple formula, there are many variables that characterize it. Broadly, test data can be divided into two categories: Arbitrary, which is random and unstructured and Realistic, which follows patterns, is predictable and controlled. To generate a Realistic test data, right patterns needs to be captured by analyzing the existing production data. Access to production data can be regulated and not easy to obtain. However, implementing code to read relevant data from production, without exposing the actual data, but updating models which are used to generate test data, when required such that the generated test data represents production data in selected dimensions, as directed by the business of the product under test.

In this session Mehul Sheth will talk about Druva's journey in generating test data at scale, which is highly influenced by production data, has ""genes"" of production data but not a single byte is taken ""as-is"" from production. Although Druva's journey and decisions taken may be unique and not directly applicable in all scenarios, session will highlight the thought process, algorithms and decisions in a generic fashion. How to focus on the ability to assess the model and tweak it to include edge conditions, remain realistic, applicable at all time, versatile, repeatable and easily controllable.

Specifically, the session describes a process for modeling a directory tree with files and folders with various variables (like size of file, number of files and folders in each folder at each depth, patterns in names of files and folders, ratio of different file types and other variables) which may be important for the application under test. And then how to apply this model to generate file-sets of different sizes but completely random data, maintaining the relations between modeled variables. Datasets thus generated are random in raw format, however, maintain the characteristics of the model and can be used for performance / stress testing anti-virus software, legal discovery software or backup software. Extending the concept further, it can be used to model any data and meta-data like mailboxes or transnational databases.


End To End Data Placement For Zoned Block Devices

Marc Acosta, Research Fellow, Western Digital Corporation

Abstract

End to End (E2E) Data Placement or intelligent placement of data onto media requires coordination between Applications, File System, and Zoned Block devices (ZBDs). If done correctly, E2E Data Placement with ZBDs will significantly reduce storage costs and improve application performance.

The talk will walk through state of the art database systems and define their data placement characteristics with the associated storage cost. Next, we discuss how E2E data placement can use the concept of a file to determine data associativity and efficiently store the file as zones on ZBDs. We will cover crucial ZBD metrics and present examples of how applications and file systems can be modified to be ZBD friendly. Methods to estimate the gains in throughput and storage cost reduction using E2E data placement and Zone Block devices will also be shown.

The attendees should leave the talk understanding how E2E data placement changes the role of Zoned Block Devices from storing LBAs to storing files. And how, by strategically mapping files, and it's data, as zones, one gain device capacity and reduces storage costs while improving both throughput and latency of your storage solution.

Storage Resource Management

SNIA Swordfish™ Overview and Deep Dive

Richelle Ahlvers , Board of Directors, SNIA; Storage Management, Software Architect, Broadcom SNIA; Broadcom Inc.

Abstract

If you’ve heard about the SNIA Swordfish open industry storage management standard specification but are looking for a deeper understanding of its value and functionality, this presentation is for you. The speaker will provide a broad look at Swordfish and describe the RESTful methods and JSON schema variants developed by SNIA’s Scalable Storage Management Technical Work Group (SSM TWG) and the Redfish Forum.


What’s New in SNIA Swordfish™

Richelle Ahlvers , Board of Directors, SNIA; Storage Management, Software Architect, Broadcom SNIA; Broadcom Inc.

Abstract

If you haven’t caught the new wave in storage management, it’s time to dive in and catch up on the latest developments of the SNIA Swordfish™ specification. These include:

  • Adding support to map NVMe and NVMe-oF to Redfish and Swordfish
  • A new document with implementers with guidance for error reporting and status code usage
  • New mockups on swordfishmockups.com showing more possible deployment permutations
  • Development of Swordfish CTP
  • ISO Standardization
  • Schema enhancements and simplifications: Moving /Storage to the Service Root
  • Tools ecosystem Enhancements: Learn about all the new tools to help with everything from mockup self-validation to protocol checking.

How to Increase Demand for Your Products with the Swordfish Conformance Test Program

Richelle Ahlvers , Board of Directors, SNIA; Storage Management, Software Architect, Broadcom SNIA; Broadcom Inc.

Abstract

New this year, the SNIA Swordfish Conformance Test Program allows manufacturers the ability to test their products with a vendor-neutral test suite to validate conformance to the SNIA Swordfish specification.

Swordfish implementations that have passed CTP are posted on the SNIA website; this information is available to help ease integration concerns of storage developers and increase demand for available Swordfish products.

This session will provide an overview an overview of the program, what functionality implementations and base requirements are needed for implementations to pass the initial version of Swordfish CTP. It will also cover the program features, additional benefits and how to participate.


NVMe and NVMe-oF Configuration and Manageability with Swordfish and Redfish

Rajalaxmi Angadi, Senior Software Developer, Intel Corporation

Krishnakumar Gowravaram, Senior Technical Leader and Architect, Cisco

Abstract

The SNIA Swordfish specification is currently growing to include full NVMe and NVMe-oF enablement and alignment across DMTF, NVMe, and SNIA for NVMe and NVMe-oF use cases. This presentation will provide an overview of the work in progress to map these standards together to ensure NVMe and NVMe-oF environments can be represented entirely in Swordfish and Redfish environments.


Zero to Swordfish Implementation Using Open Source Tools

Don Deel, SMI GB Chair, SNIA; Senior Standards Technologist, NetApp, SNIA; NetApp

Chris Lionetti, Board of Directors, SNIA; Senior Technical Marketing Engineer, HPE

Abstract

SNIA’s Storage Management Initiative sponsored the initial development of open source software tools that can help developers start working with Swordfish. These tools are available in open repositories that are managed by the SNIA Scalable Storage Management Technical Working Group on GitHub.

This session will walk through the tools you can use to go from zero to working SNIA Swordfish implementations. Starting from generating, validating and using static mockups, using the emulator to make your mockups “come alive,” and then verifying your Swordfish service outputs match your expectations using open source validation tools; the same tools that feed into the Swordfish Conformance Test Program.


Migrating OEM Extensions to Swordfish for Scalable Storage Management

Krishnakumar Gowravaram , Senior Technical Leader and Architect, Cisco

Abstract

Before the release of the SNIA Swordfish™ v1.1.0 specification, direct attach server vendors trying to accomplish scalable or complex storage management in the DTMF Redfish® standard, had to use OEM extensions to extend the limited storage management functionality that Redfish provides. Redfish is designed to manage converged, hybrid IT and the software defined data center.

During this presentation, the speaker from Cisco will provide an overview of the company’s existing storage management solution using Redfish storage and OEM extensions. The speaker will also discuss Cisco’s implementation experience to-date that consists of planning the migration of its OEM storage management Redfish extensions to the standards-based schema in the v1.1.0 SNIA Swordfish specification.


Redfish Ecosystem for Storage

Jeff Hilland , President, DMTF; Distinguished Technologist at HPE, DMTF; HPE

Abstract

DMTF’s Redfish® is a standard API designed to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center (SDDC).

This presentation will provide an overview of DMTF’s Redfish standard. It will also provide an overview HPE’s implementation of Redfish, focusing on their storage implementation and needs.

HPE will provide insights into the benefits and challenges of the Redfish Storage model, including areas where functionality added to SNIA™ Swordfish is of interest for future releases.

Zone

Zoned Namespaces (ZNS) SSDs: Disrupting the Storage Industry

Matias Bjørling, Director, Emerging System Architectures, Western Digital Corporation

Abstract

The Zoned Namespaces (ZNS) SSDs is a new Command Set in NVMe™, it exposes a zoned block storage interface between the host and the SSD, that allows the SSD to align the data to its media perfectly. As a result, an SSD can now expose more storage capacity (+20%), reduce SSD write amplification (4-5x), and improve I/O access latencies.

This talk introduces the Zoned Namespaces Command Set, which defines a new type of namespace (Zoned Namespaces) and the associated Zone Storage Model, optimized for SSDs. We show specific use cases where ZNS applies, and how to take advantage of it and use it in your applications.


Reviving The QEMU NVMe Device (from Zero to ZNS)

Klaus Jensen, Staff Software Engineer, Samsung Electronics

Abstract

The QEMU NVMe device allows developers to test host software against an emulated and easily inspectable PCIe device implementing NVMe. Unfortunately development and addition of new features has mostly stagnated since its original inclusion in the QEMU project.

This talk will explore how development of the device is being revivied by the addition of NVMe v1.3 and v1.4 mandatory support, as well as various optional features such as multiple namespaces, DULBE, end-to-end data protection and upcoming NVMe technical proposals.

We will discuss how the tracing and debugging features of the device can be used to validate host software and testing frameworks and how the extensibility of the device allows rapid prototyping of new NVMe features. Specifically we will explore a full implementation of Zoned Namespaces and how this support is used to develop and verify host software.


Zoned Block Device Support in Hadoop HDFS

Shin'ichiro Kawasaki, Principal Engineer, Western Digital Corporation

Abstract

Zoned storage devices are a class of block devices with an address space that is divided into zones which, unlike regular storage devices, can only be written sequentially. The most common form of zoned storage today are Shingled Magnetic Recording (SMR) HDDs. This type of disk allows higher capacities without a significant device manufacturing cost increase, thereby resulting in overall storage cost reductions.

Support for zoned block devices (ZBD) was introduced in Linux with kernel version 4.10. This support provides an interface for user applications to manipulate zones of a zoned device and also guarantee that writes issued sequentially will be delivered in the same order to the disk, thereby meeting the device sequential write constraint.

Hadoop HDFS is a well known distributed file system with high scalability properties, making it an ideal choice for big data computing applications. HDFS is designed for large data sets written mostly sequentially with a streaming like access pattern. This characteristic is ideal for zoned device support, facilitating direct access to the device from HDFS rather than relying on an underlying local file system with ZBD support, an approach that potentially has higher overhead due to the file system garbage collection activity.

This talk introduces a candidate implementation of ZBD support in Hadoop HDFS based on the simple Linux zonefs file system. This file system exposes the zones of a zoned device as files. HDFS data blocks are themselves stored in zonefs files. Symbolic links reference the zonefs files in HDFS block file directory structure. File I/Os unique to zonefs files are encapsulated with a new I/O provider. The presentation will give an overview of this implementation and discuss performance results, comparing the performance of HDFS without any modification using a ZBD compliant local file systems (f2fs and btrfs) with the performance obtained with the direct access zonefs approach. The benefits in terms of lower software complexity of this latter approach will also be addressed.


zonefs: Mapping POSIX File System Interface to Raw Zoned Block Device Accesses

Damien Le Moal, Director, Western Digital Corporation

Abstract

The zonefs file system is a simple file system that exposes zones of a zoned block device (host-managed or host-aware SMR hard-disks and NVMe ZonedNamspace SSDs) as files, hiding from the application most zoned block device zone management and access constraints. The zonefs file system is intended as a simple solution for use cases where raw block device accesses from the application have been considered a better solution.

This talk will present zonefs features, with a focus on how the rich POSIX file system-call interface is used to seamlessly implement the execution of zoned block device specific operations. In particular, the talk will cover zonefs changes to seamlessly accommodate the new NVMe Zoned Namespace (ZNS), such as number of active zones, the time zones can remain in the active state, and the new Zone Append command. The talk will conclude with an example use of zonefs with the key-value store application LevelDB, showing the advantages in term of code simplicity over raw block device file accesses. Performance results with LevelDB as well as with synthetic benchmarks are also shown.


High-perforance SMR drives with dm-zoned caching

Hannes Reinecke, Kernel Storage Architect, SUSE Software Solutions

Abstract

SMR drives have a very demanding programming model, requiring the host software to format write requests withing very strict limits. This typically imposes a performance penalty when writing to SMR drives such that the nominal performance is hard to achieve.

The existing dm-zoned device-mapper target implements an internal caching using random zones; while this allows for unmodified host software to run on SMR drives the performance impact is even more severe.

In this talk I will present an update to dm-zoned, which extends the current implementation to use additional drives, either as a cache device or as additional zoned devices. This allows to saturate the SMR drives without having to modify the host software.

By using a fast cache device like NV-DIMM one can easily scale the dm-zoned device across several SMR drives, presenting tens of terabytes to the application with near-native NV-DIMM speeds.

I will present the design principles of this extension, and providing a short demo for showing the improvements.


FC-Encryption at wirespeed

Hannes Reinecke, Kernel Storage Architect, SUSE Software Solutions

Abstract

FC SANs are deployed in over 90% of Fortune 1000 customer data centers that run mission-critical storage workloads. Ever increasing threat vectors and tightening regulation is driving customer in healthcare, banking and defense to better secure their storage networks. While FC SAN encryption as defined in FC-SP-2 is well defined and stable, an implementation has long since missing due to the complexities and the required encryption performance, making a hardware encryption offload the best option.

But a hardware implementation requires the complex infrastructure from the OS side, to allow for the necessary key handling and negotiation.

These mutual dependencies have long prevented any useable implementation.

In this talk Marvell and SUSE will be presenting a combined solution offloading encryption into the hardware, while having the infrastructure in place for key handling via a StrongSWAN adaption in the SUSE Linux Enterprise OS.

With this setup we're able to achieve near linespeed with encrypted FC traffic, with all key management functionality mandated by the specification.