Abstract
One of the most common benchmarks in the storage industry is 4KiB random read I/O per second. Over the years, the industry first saw the publication of 1M I/Ops on a single box, then 1M I/Ops on a single thread (by SPDK). More recently, there have been publications outlining 10M I/Ops on a single box using high performance NVMe devices and more than 100 CPU cores. This talk will present a benchmark of SPDK performing more than 10 million random 4KiB read operations per second from a single thread to 20 NVMe devices, a large advance compared to the state of the art of the industry. SPDK has developed a number of novel techniques to reach this level of performance, which will be outlined in detail here. These techniques include polling, advanced MMIO doorbell batching strategies, PCIe and DDIO considerations, careful management of the CPU cache, and the use of non-temporal CPU instructions. This will be a low level talk with real examples of eliminating data dependent loads, profiling last level cache misses, pre-fetching, and more. Additionally, there remains a number of techniques that have not yet been employed that warrant future research. These techniques often push devices outside of their original intended operating mode, while remaining within the bounds of the specification, and so often require collaboration between NVMe controller and device designers, the NVMe specification body, and software developers such as the SPDK team.