

# Architectural Principles for Networked Solid State Storage Access – Part 2





Doug Voigt Chair, NVM Programming Model, SNIA Technical Council Distinguished Technologist, HPE

J Metz SNIA Board of Directors R&D Engineer Cisco













- The material contained in this presentation is copyrighted by the SNIA unless otherwise noted.
- Member companies and individual members may use this material in presentations and literature under the following conditions:
  - Any slide or slides used must be reproduced in their entirety without modification
  - The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.
- This presentation is a project of the SNIA.
- Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.
- The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information.

#### NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.



#### Technical and market dynamics are creating complexity

- Permutations of technologies and positioning
- Intricacies of new technology integration
- Foundational principles have not changed
  - Application views of memory and storage
  - The role of data access time (latency) in system architecture

#### Principles transcend details

- This presentation uses principles to guide detailed analysis
- This presentation does not report benchmark results
- On-demand webcast: Architectural Principles for Networked Solid State Storage Access – Part 1 <u>https://www.brighttalk.com/webcast/663/203821</u>

# **Times are changing**



#### Storage access times are shrinking

- Emerging persistent memory (PM) technologies
- Faster than flash

## Interconnects are getting faster

- Bandwidth
- Latency

## Creating challenges for software

- Software stacks are starting to dominate latency
- Trigger for a fundamental architecture shift



- Application views of Persistent Memory technology
- Why latency determines application view
- Latency budget analysis
- Latency and system scale





- IO protocol used to access storage
- Poll repeated reading of IO state to detect completion
- Context Switch allow other processes to use a core
- Load/Store (Ld/St) CPU instructions that access memory
- Non-Uniform Memory Access (NUMA) describes a memory system that exhibits a significant range of latencies due to underlying technology or scale



# **Application View**



## ♦ 10

- Data is read or written using RAM buffers
- Software has control over how to wait (context switch or poll)
- Status is explicitly checked by software

## Ld/St

- Data is loaded into or stored from processor registers
- Software is forced by processor to wait for data during instruction
- No status checking errors generate exceptions















- Does not require (long term) power to retain its contents
- Can be accessed using Ld/St/Mov instructions
- Without too much loss of processor throughput
  - When is it OK to force the CPU memory access pipeline to wait for a storage or memory access to complete?
    - May pause a thread in the middle of executing an instruction
    - May block reads and writes to all core

# Pipeline Stall Wastes Processor Through



- Each core's instruction pipeline has a limited number of stages (5 shown here)
- If a memory access (column M, time t<sub>6</sub>) takes longer than the pipeline is designed for, it stalls (t<sub>7</sub>-t<sub>9</sub>) so processing power is wasted.
- If this happens a lot the processor can grind to a halt



#### NUMA Systems are designed to mitigate these effects for bounded memory latencies

#### Why latency determines application view **SNIA** | ETHERNET ESF | STORAGE

|                | Pro               | Con                      | Acceptable<br>Latency |
|----------------|-------------------|--------------------------|-----------------------|
| Ld/St          | Lowest overhead   | Stalls pipelines if slow | NUMA                  |
| Poll           | Moderate overhead | Consumes one thread      | < ~2 uS               |
| Context Switch | High overhead     | Free while blocked       | > ~2 uS               |

- The acceptable upper bound of NUMA latency depends on processor architecture and application instruction mix
- The acceptable upper bound for polling depends on processor specific context switch time







# **Latency Budgets**





- Interconnect hops
- Media
- Host
- Queueing throughput

# **Serial Interconnect Hop Latency**



#### Considerations

- Speed of Light: .3 m/nS. (e.g. 2m = 7 nS) + 10-100nS for SERDES pair
- Data Transfer: Xmit bit rate \* (Headers + Payload) \* Encoding Derating
- Port: SERDES + 0-100 nS
- Switching, Routing: 0-∞
- "Typical" switch latency examples
  - PCle: >= 20 nS
  - IB: >= 90 nS
  - Ethernet: >= 300 nS

#### 1 Interconnect hop = 2\*Ports(>= 10 nS per pair) + <#Switches>\*Switch(>=20 nS) + distance/.3m/ns(or more)

# **Media Latency**



#### Considerations

- Command/Response HW: 10-1000 nS
- Driver Software, Interrupt Response: 0-20 uS
- Translation/Virtualization: 0-100 uS
- Seek/Select/Enable: 10 nS (DRAM), 10+ uS (Flash), 1+ mS (HDD)
- Data Transfer: Media bit rate \* (Headers + Payload) \* Encoding

## "Typical" Examples (single threaded)

- 20 nS DRAM
- 70 uS Flash
- 1 mS HDD



#### Host Considerations

- PCIe: Treat as additional hops for modular RNIC/HBAs
- Driver Software: 1 50 uS
- Queue Considerations
  - Latency = 1/(Service Rate Arrival Rate)





# Latency budget templates



## Examples provided

- IO
- RDMA and Remote Persistence
- Scale out memory

## Numerical disclaimer:

- Template latency examples are from the building block slides above.
- The main purpose is to enable engineers to determine where their networked PM implementations fall in the latency disruption chart.
- Constant innovation and tuning in components and systems continues to drive latency down.

# Latency template for IO





- 1. Application sets up buffer, command
- 2. Application sends command to SSD RAM
- 3. SSD SW processes IO
- 4. SSD accesses media, RAM data buffer (read)
- 5. SSD Sends data, response
- 6. Application receives response

# IO latency budget in uS\*



|                                                     | Host      | Network   | Device (SSD)                                            |     |
|-----------------------------------------------------|-----------|-----------|---------------------------------------------------------|-----|
| I                                                   | 50        | 1.2 - 1.6 |                                                         |     |
| 2                                                   |           | .13       |                                                         |     |
| 3                                                   |           |           | 100                                                     |     |
| 4                                                   |           | 1.1 - 1.3 |                                                         |     |
| 5                                                   |           |           | see above                                               |     |
| 6                                                   | see above |           |                                                         |     |
| Totals                                              | 50        | 1.2 - 1.6 | 100                                                     | 152 |
| <ul><li>1K data at 1 Gby</li><li>1 Switch</li></ul> |           |           | <ul><li>Moderate load</li><li>All units in uS</li></ul> |     |

© 2017 Storage Networking Industry Association. All Rights Reserved.

\*Rough Approximations 26



Network

- 1. Application establishes RDMA connection during mMap
- 2. Application executes 1 or more RDMA writes during flush
- 3. Application executes RDMA send to force remote flush
- 4. Application receives response from remote flush

Network

Adapter

PCle

CPU

For more on RDMA see:

PCle

SSD/ HDD CPU

NVDIMM

Network

Adapter

© 2017 Storage Networking Industry Association. All Rights Reserved.

https://www.brighttalk.com/webcast/663/185909 27



|        | Host  | Network   | Device<br>(PM) | Device<br>(CPU) |    |
|--------|-------|-----------|----------------|-----------------|----|
| 1      | NA    | NA        | NA             |                 |    |
| 2      | 10    | 1.1 - 1.3 | 2              |                 |    |
| 3      | 20    | . 3       | above          | 20              |    |
| 4      | above | . 3       |                | above           |    |
| Totals | 30    | 1.3 - 1.9 | 2              | 20              | 54 |

- 1K data at 1 Gby
- 1 Switch

- Single RDMA write
- All units in uS

# Gen-Z: A New Data Access Technology

| High<br>Bandwidth<br>Low Latency   | <ul> <li>Memory Semantics - simple Reads and<br/>Writes</li> <li>From tens to several hundred GB/s of<br/>bandwidth</li> <li>Sub-100 ns load-to-use memory latency</li> </ul> |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                    | <ul> <li>Real time analytics</li> </ul>                                                                                                                                       |
| Advanced                           | <ul> <li>Enables data centric and hybrid computing</li> </ul>                                                                                                                 |
| Workloads<br>&<br>Technologies     | <ul> <li>Scalable memory pools for in memory<br/>applications</li> </ul>                                                                                                      |
|                                    | <ul> <li>Abstracts media interface from SoC to<br/>unlock new media innovation</li> </ul>                                                                                     |
|                                    | <ul> <li>Provides end-to-end secure connectivity<br/>from node level to rack scale</li> </ul>                                                                                 |
| Secure<br>Compatible<br>Economical | <ul> <li>Supports unmodified OS for SW<br/>compatibility</li> </ul>                                                                                                           |
|                                    | <ul> <li>Graduated implementation from simple,<br/>low cost to highly capable and robust</li> </ul>                                                                           |
|                                    | <ul> <li>Leverages high-volume IEEE physical<br/>layers and broad, deep industry ecosystem</li> </ul>                                                                         |



# Latency Template for Scale Out Memory NIA ETHERNET e.g. Gen-Z



1. Application uses St instruction to write to remote memory

© 2017 Storage Networking Industry Association. All Rights Reserved.

For more on RDMA see: https://www.brighttalk.com/webcast/663/185909

30



|              | Host | Network | NVDIMM |   |
|--------------|------|---------|--------|---|
| 1            | .01  | .2      | .I     |   |
| Per St Total | .01  | .2      | .I     |   |
| 16 Lines     | .16  | 3.2     | 1.6    | 5 |

- 64 By cache line per St at 1 Gby
- 1 Switch
- Consider using mov or put/get

- 16 St's for 1K
- All units in uS

# Transmission Distance Realistically communication between

Latency contributors due to system scale I STORAGE

- Realistically communication betwee adjacent racks requires about 5 meters of optical cable.
- This takes 17 nS.

#### Switch/Router hops

 Switches appear at blade, chassis and rack levels. Each contributes 100-300 nS latency totaling .5-1.5 uS



# Adding it all up

# Ld/St Inhibitors

- networks
- > rack scale memory fabric ٠
- media technology mismatch
- High Ld/St Latency Pain depends on workload
  - PM access size
  - PM access mix
  - Flush





# **Other Helpful Resources**



#### On-demand Webcasts:

- Architectural Principles for Networked Solid State Storage Access Part 1 <u>https://www.brighttalk.com/webcast/663/203821</u>
- Everything You Wanted to Know about Storage But Were too Proud to Ask: Part Teal – Buffers, Queues & Caches <u>https://www.brighttalk.com/webcast/663/241275</u>
- Storage Performance Benchmarking: Solution Under Test <u>https://www.brighttalk.com/webcast/663/164335</u>
- SNIA NVM Programming Model: <u>http://www.snia.org/sites/default/files/technical\_work/final/</u> <u>NVMProgrammingModel\_v1.1.pdf</u>
- SNIA PM White Papers: <u>https://www.snia.org/education/whitepapers</u>



- Please rate this webcast and provide us with feedback
- This webcast and a PDF of the slides will be posted to the SNIA Ethernet Storage Forum (ESF) website and available on-demand
- www.snia.org/forums/esf/knowledge/webcasts
- A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-ESF blog: <u>sniaesfblog.org</u>
- Follow us on Twitter @SNIAESF



# Thank you!