Erik Smith

Apr 9, 2025

title of post

Have you ever wondered how RDMA (Remote Direct Memory Access) actually works? You’re not alone, that’s why the SNIA Data, Storage & Networking Community (DSN) hosted a live webinar, “Everything You Wanted to Know About RDMA But Were Too Proud to Ask” where our expert presenters, Michal Kalderon and Rohan Mehta explained how RDMA works and the essential role it plays for AI/ML workloads due to its ability to provide high-speed, low-latency data transfer. The presentation is on demand along with the webinar slides in the SNIA Educational Library.

The live audience was not “too proud” to ask questions, and our speakers have graciously answered all of them here. 

Q: Does the DMA chip ever reside on the newer CPUs or are they a separate chip?

A: Early system designs used an external DMA Controller. Modern systems have moved to an integrated DMA controller design. There is a lot of information about the evolution of the DMA Controller available online. The DMA Wikipedia article on this topic is a good place to start.   

Q: Slide 51: For RoCEv2, are you routing within the rack, across the data center, or off site as well?

A:  RoCEv2 operates over Layer 3 networks, and typically requires a lossless network environment achieved through mechanisms like PFC, DCQCN, and additional congestion control methods discussed in the webinar. These mechanisms are easier to implement and manage within controlled network domains, such as those found in data centers. While theoretically, RoCEv2 can span multiple data centers, it is not commonly done due to the complexity of maintaining lossless conditions across such distances. 

Q: You said that WQEs have opcodes. What do they specify and how are they defined?

 A: The WQE opcodes are the actual RDMA operations referred to in slide 16, 30
SEND, RECV, WRITE, READ, ATOMIC. For each one of these operations there are additional fields that can be set. 

Q: is there a mistake in slide 27? QP-3 on the receive side (should they be labeled QP-1, QP-2 and QP3) or am I misunderstanding something?

A: Correct, good catch. The updated deck is here.

Q: Is the latency deterministic after connection?

A: Similar to all network protocols, RDMA is subject to factors such as network congestion, interfering data packets, and other network conditions that can affect latency.

Q: How does the buffering works? If a server sends data and the client is unable to receive all the data due to buffer size limitations?

A: We will split the answer for the two groups of RDMA operations.

  • (a) Channel Semantics: Send and Recv. In this case, the application must make sure that the receiver has posted an RQ buffer before performing the send operation. This is typically done by the client side posting RQ buffers before initiating a connection request. If the send arrives and there is no RQ buffer posted, the packets are dropped and a RNR NAK message is sent to the sender. There is a configurable parameter for the QP called, rnr_retry_cnt, which specifies how many times the RNIC should try resending messages if it gets and RNR NAK.
  • (b) For Memory semantics, Read, Write, Atomic, this cannot occur since the data is placed on pre-registered memory in a specific location. 

Q: How/where are the packet headers handled?

A: The packet headers are handled by the R-NIC (RDMA NIC). 

Q: How Object (S3) over RDMA is implemented at high level, does it still involve HTTP/S?

A: There are proprietary solutions that provide Object (S3) over RDMA, but there is currently no industry standard available that could be used to create non-vendor specific implementations that would be interoperable with one another.    

Q: How much packet per second transport for single core x86 vs ARM Server single core over TCP/IP?

A: When measuring RDMA performance, you measure messages per second and not packets-per-second as opposed to TCP, you don’t have per-packet processing in the host. The performance depends more on the R-NIC than the host core, as it bypassed CPU processing. If you’d like a performance comparison between RDMA (RoCE) and TCP, please refer to the “NVMe-oF Looking Beyond Performance Hero Numbers” webinar.

Q: Could you clarify the reason for higher latency using interrupt method?

A: It depends upon the mode of operation:

Polling Mode: In polling mode, the CPU continuously checks (or "polls") the completion queue for new events. This constant checking eliminates the delay associated with waiting for an interrupt signal, leading to faster detection and handling of events. However, this comes at the cost of higher CPU utilization since the CPU is always active, even when there are no events to process.

Interrupt Mode: In interrupt mode, the CPU is notified of new events via interrupts. When an event occurs, an interrupt signal is sent to the CPU, which then stops its current task to handle the event. This method is more efficient in terms of CPU usage because the CPU can perform other tasks while waiting for events. However, the process of generating, delivering, and handling interrupts introduces additional latency compared to the immediate response of polling

Q: Slide 58: Does RDMA complement or compete with CXL?

A: They are not directly related. RDMA is a protocol used to perform Remote DMA operations (i.e., over a network of some kind) and CXL is a used to provide high-speed, coherent communication within a single system. CXL.mem allows devices to access memory directly within a single system or a small tightly coupled group of systems. If this question was specific to DMA, as opposed to RDMA, the answer would be slightly different. 

Q: The RDMA demonstration had MTU size on the RDMA NIC set to 1K. Does RDMA traffic benefit from setting the MTU size to a larger setting (3k-9k MTU size) or is that really dependent on the amount of traffic the RoCE application generates over the RDMA NIC?

A: RDMA traffic, similar to other protocols like TCP, can benefit from setting the MTU size to a larger setting. It can reduce packet processing overhead and improve throughput. 

Q: When app send read/write request, how app gets remote site RKEY info? Q2 - Is it possible to tweak LKEY pointing to same buffer for debugging any memory related issue? Fairly new to topic so apologies in advance if any query doesn't make sense.

A: The RKEYs can be exchanged using the channel semantics, where SEND/RECV are used. In this case there is no need for an r-key as the message will arrive to the first buffer posted in the RQ buffer on the peer.  LKEY refers to local memory. Every registered memory LKEY and RKEY will point to same location. The LKEY is used by the local RNIC to access the memory and RKEY will be provided to the remote application to access the memory. 

Q: What does SGE stand for?

A: Scatter Gather Element. An element inside an SGL – Scatter Gather List. Used to reference non-contiguous memory.

Thanks to our audience for all these great questions. We encourage you to join us at future SNIA DSN webinars. Follow us @SNIA and on LinkedIn for upcoming webinars. This webinar was part of the “Everything You Wanted to Know But Were Too Proud to Ask” SNIA webinar series. If you found the information in this RDMA webinar helpful, I encourage you to check out the many other ones we have produced. They are all available here on the SNIAVideo YouTube Channel.   

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Beyond NVMe-oF Performance Hero Numbers

Erik Smith

Jan 28, 2021

title of post

When it comes to selecting the right NVMe over Fabrics™ (NVMe-oF™) solution, one should look beyond test results that demonstrate NVMe-oF’s dramatic reduction in latency and consider the other, more important, questions such as “How does the transport really impact application performance?” and “How does the transport holistically fit into my environment?”

To date, the focus has been on specialized fabrics like RDMA (e.g., RoCE) because it provides the lowest possible latency, as well as Fibre Channel because it is generally considered to be the most reliable.  However, with the introduction of NVMe-oF/TCP this conversation must be expanded to also include considerations regarding scale, cost, and operations. That’s why the SNIA Networking Storage Forum (NSF) is hosting a webcast series that will dive into answering these questions beyond the standard answer “it depends.”

The first in this series will be on March 25, 2021 “NVMe-oF: Looking Beyond Performance Hero Numbers” where SNIA experts with deep NVMe and fabric technology expertise will discuss the thought process you can use to determine pros and cons of a fabric for your environment, including:

  • Use cases driving fabric choices  
  • NVMe transports and their strengths
  • Industry dynamics driving adoption
  • Considerations for scale, security, and efficiency

Future webcasts will dive deeper and cover operating and managing NVMe-oF, discovery automation, and securing NVMe-oF. I hope you will register today. Our expert panel will be available on March 25th to answer your questions.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Beyond NVMe-oF Performance Hero Numbers

Erik Smith

Jan 28, 2021

title of post
When it comes to selecting the right NVMe over Fabrics™ (NVMe-oF™) solution, one should look beyond test results that demonstrate NVMe-oF’s dramatic reduction in latency and consider the other, more important, questions such as “How does the transport really impact application performance?” and “How does the transport holistically fit into my environment?” To date, the focus has been on specialized fabrics like RDMA (e.g., RoCE) because it provides the lowest possible latency, as well as Fibre Channel because it is generally considered to be the most reliable. However, with the introduction of NVMe-oF/TCP this conversation must be expanded to also include considerations regarding scale, cost, and operations. That’s why the SNIA Networking Storage Forum (NSF) is hosting a webcast series that will dive into answering these questions beyond the standard answer “it depends.” The first in this series will be on March 25, 2021 “NVMe-oF: Looking Beyond Performance Hero Numbers” where SNIA experts with deep NVMe and fabric technology expertise will discuss the thought process you can use to determine pros and cons of a fabric for your environment, including:
  • Use cases driving fabric choices
  • NVMe transports and their strengths
  • Industry dynamics driving adoption
  • Considerations for scale, security, and efficiency
Future webcasts will dive deeper and cover operating and managing NVMe-oF, discovery automation, and securing NVMe-oF. I hope you will register today. Our expert panel will be available on March 25th to answer your questions.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Tim Lustig

Jan 15, 2020

title of post
We kicked-off our 2020 webcast program by diving into how The Storage Performance Development Kit (SPDK) fits in the NVMe landscape. Our SPDK experts, Jim Harris and Ben Walker, did an outstanding job presenting on this topic. In fact, their webcast, “Where Does SPDK Fit in the NVMe-oF Landscape” received at 4.9 rating on a scale of 1-5 from the live audience. If you missed the webcast, I highly encourage you to watch it on-demand. We had some great questions from the attendees and here are answers to them all: Q. Which CPU architectures does SPDK support? A. SPDK supports x86, ARM and Power CPU architectures. Q. Are there plans to extend SPDK support to additional architectures? A. If someone has interest in using SPDK on additional architectures, they may develop the necessary SPDK patches and submit them for review.  Please note that SPDK relies on the Data Plane Development Kit (DPDK) for some aspects of CPU architecture support, so DPDK patches would also be required. Q. Will SPDK NVMe-oF support QUIC?  What advantages does it have compared to RDMA and TCP transports? A. SPDK currently has implementations for all of the transports that are part of the NVMe and related specifications – RDMA, TCP and Fibre Channel (target only).  If NVMe added QUIC (a new UDP-based transport protocol for the Internet) as a new transport, SPDK would likely add support. QUIC could be a more efficient transport than TCP, since it is a reliable transport based on multiplexed connections over UDP. On that note, the SNIA Networking Storage Forum will be hosting a webcast on April 2, 2020. “QUIC – Will it Replace TCP/IP?” You can register for it here. Q. How do I map a locally attached NVMe SSD to an NVMe-oF subsystem? A. Use the bdev_nvme_attach_controller RPC to create SPDK block devices for the NVMe namespaces. You can then attach those block devices to an existing subsystem using the nvmf_subsystem_add_ns RPC. You can find additional details on SPDK nvmf RPCs here. Q. How can I present a regular file as a block device over NVMe-oF?

A. Use the bdev_aio_create RPC to create an SPDK block device for the desired file. You can then attach this block device to an existing subsystem using the nvmf_subsystem_add_ns RPC.  You can find additional details on SPDK nvmf RPCs here.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Author of NVMe™/TCP Spec Answers Your Questions

J Metz

Mar 27, 2019

title of post

900 people have already watched our SNIA Networking Storage Forum webcast, What NVMe™/TCP Means for Networked Storage? where Sagi Grimberg, lead author of the NVMe/TCP specification, and J Metz, Board Member for SNIA, explained what NVMe/TCP is all about. If you haven’t seen the webcast yet, check it out on-demand.

Like any new technology, there’s no shortage of areas for potential confusion or questions. In this FAQ blog, we try to clear up both.

Q. Who is responsible for updating NVMe Host Driver?

A. We assume you are referring to the Linux host driver (independent OS software vendors are responsible for developing their own drivers). Like any device driver and/or subsystem in Linux, the responsibility of maintenance is on the maintainer(s) listed under the MAINTAINERS file. The responsibility of contributing is shared by all the community members.

Q. What is the realistic timeframe to see a commercially available NVME over TCP driver for targets? Is one year from now (2020) fair?

A. Even this year commercial products are coming to market. The work started even before the spec was fully ratified, but now that it has been, we expect wider NVMe/TCP support available. Q. Does NVMe/TCP work with 400GbE infrastructure? A. As of this writing, there is no reason to believe that upper layer protocols such as NVMe/TCP will not work with faster Ethernet physical layers like 400GbE. Q. Why is NVMe CQ in the controller and not on the Host? A. The example that was shown in the webcast assumed that the fabrics controller had an NVMe backend. So the controller backend NVMe device had a local completion queue, and on the host sat the “transport completion queue” (in NVMe/TCP case this is the TCP stream itself). Q. So, SQ and CQ streams run asynchronously from each other, with variable ordering depending on the I/O latency of a request? A. Correct. For a given NVMe/TCP connection, stream delivery is in-order, but commands and completions can arrive (and be processed by the NVMe controller) in any order. Q. What TCP ports are used? Since we have many NVMe queues, I bet we need a lot of TCP ports. A. Each NVMe queue will consume a unique source TCP port. Common NVMe host implementations will create a number of NVMe queues in the same order of magnitude of the number of CPU cores. Q. What is the max size of Data PDU supported? Are there any restrictions in parallel writes? A. The maximum size of an H2CData PDU (MAXH2CDATA) is negotiated and can be as large as 4GB. It is recommended that it will be no less than 4096 bytes. Q. Is immediate data negotiated between host and target? A. The in-capsule data size (IOCCSZ) is negotiated on an NVMe level. In NVMe/TCP the admin queue command capsule size is 8K by default. In addition, the maximum size of the H2CData PDU is negotiated during the connection initialization. Q. Is NVMe/TCP hardware infrastructure cost lower? A. This can vary widely, but we assume you are referring to Ethernet hardware infrastructure. Plus, NVMe/TCP does not require RDMA capable NIC so the variety of implementations is usually wider which typically drives down cost. Q. What are the plans for the major OS suppliers to support NVMe over TCP (Windows, Linux, VMware)? A. Unfortunately, we cannot comment on their behalf, but Linux already supports NVMe/TCP which should find its way to the various distributions soon. We are working with others to support NVMe/TCP, but suggest asking them directly. Q. Where does the overhead occur for NVMe/TCP packetization, is it dependent on the CPU, or does the network adapter offload that heavy lifting? And what is the impact of numerous, but extremely small transfers? A. Indeed a software NVMe/TCP implementation will introduce an overhead resulting from the TCP stream processing. However, you are correct that common stateless offloads such as Large Receive Offload and TCP Segmentation Offload are extremely useful both for large and for small 4K transfers. Q. What do you mean Absolute Latency is higher than RDMA by “several” microseconds? <10us, tens of microseconds, or 100s of microseconds? A. That depends on various aspects such as the CPU model, the network infrastructure, the controller implementation, services running on top etc. Remote access to raw NVMe devices over TCP was measured to add a range between 20-35 microseconds with Linux in early testing, but the degrees of variability will affect this. Q. Will Wireshark support NVMe/TCP soon? Is an implementation in progress? A. We most certainly hope so, it shouldn’t be difficult, but we are not aware of an ongoing implementation in progress. Q. Are there any NVMe TCP drivers out there? A. Yes, Linux and SPDK both support NVMe/TCP out-of-the-box, see: https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/ Q. Do you recommend a dedicated IP network for the storage traffic or can you use the same corporate network with all other LAN traffic? A. This really depends on the use case, the network utilization and other factors. Obviously if the network bandwidth is fully utilized to begin with, it won’t be very efficient to add the additional NVMe/TCP “load” on the network, but that alone might not be the determining factor. Otherwise it can definitely make sense to share the same network and we are seeing customers choosing this route. It might be useful to consider the best practices for TCP-based storage networks (iSCSI has taught valuable lessons), and we anticipate that many of the same principles will apply to NVMe/TCP. The AQM, buffer etc. tuning settings is very dependent on the traffic pattern and needs to be developed based on the requirements. Base configuration is determined by the vendors. Q. On slide 28: no, TCP needs the congestion feedback, mustn’t need to be a drop (could be ecn, latency variance etc) A. Yes, you are correct. The question refers to how that feedback is received, though, and in the most common (traditional) TCP methods it’s done via drops. Q. How can you find out/check what TCP stack (drop vs. zero-buffer) your network is using? A. The use/support of DCTCP is mostly driven by the OS. The network needs to support and have ECN enabled and correctly configured for the traffic of interest. So the best way to figure this out is to talk to the network team. The use of ECN,etc. needs to be developed between server and network team Q. On slide 33: drop is signal of overloaded network; congestion on-set is when there is a standing Q (latency already increases). Current state of the art is to always overload the network (switches). A. ECN is used to signal before drop happens to make it more efficient. Q. Is it safe to assume that most current switches on the market today support DCTCP/ECN and that we can mix/match switches from vendors across product families? A. Most modern ASICS support ECN today. Mixing different product lines needs to be carefully planned and tested. AQM, Buffers etc. need to be fine-tuned across the platforms. Q. Is there a substantial cost savings by implementing all of what is needed to support NVMe over TCP versus just sticking with RDMA? Much like staying with Fibre Channel instead of risking performance with iSCSI not being and staying implemented correctly. Building the separately supported network just seems the best route. A. By “sticking with RDMA” you mean that you have already selected RDMA, which means you already made the investments to make it work for your use case. We agree that changing what currently works reliably and meets the targets might be an unnecessary risk. NVMe/TCP brings a viable option for Ethernet fabrics which is easily scalable and allows you to utilize a wide variety of both existing and new infrastructure while still maintaining low latency NVMe access. Q. It seems that with multiple flavors of TCP and especially congestion management (DCTCP, DCQCN?) is there a plan for commonality in ecosystem to support a standard way to handle congestion management? Is that required in the switches or also in the HBAs? A. DCTCP is an approach for L3 based congestion management, whereas DCQCN is a combination of PFC and ECN for RoCEv2(UDP) based communication. So both of these are two different approaches. Q. Who are the major players in terms of marketing this technology among storage vendors? A. The key organization to find out about NVMe/TCP (or all NVMe-related material, in fact), is NVM Express® Q. Can I compare the NVMe over TCP to iSCSI? A. Easy, you can download upstream kernel and test both of the in-kernel implementations (iSCSI and NVMe/TCP). Alternatively you can reach out to a vendor that supports any of the two to test it as well. You should expect NVMe/TCP to run substantially faster for pretty much any workload. Q. Is network segmentation crucial as “go to” architecture with host to storage proximity objective to accomplish objective of manage/throttled close to near loss-less connectivity? A. There is a lot to unpack in this question. Let’s see if we can break it down a little. Generally speaking, best practice is to keep the storage as close to the host as possible (and is reasonable). Not only does this reduce latency, but it reduces the variability in latency (and bandwidth) that can occur at longer distances. In cases where storage traffic shares bandwidth (i.e., links) with other types of traffic, the variable nature of different applications (some are bursty, others are more long-lived) can create unpredictability. Since storage – particularly block storage – doesn’t “like” unpredictability, different methods are used to regain some of that stability as scales increase. A common and well-understood best practice is to isolate storage traffic from “regular” Ethernet traffic. As different workloads tend to be either “North-South” but increasingly “East-West” across the network topologies, this network segmentation becomes more important. Of course, it’s been used as a typical best practice for many years with protocols such as iSCSI, so this is not new. In environments where the variability of congestion can have a profound impact on the storage performance, network segmentation will, indeed, become crucial as a “go-to” architecture. Proper techniques at L2 and L3 will help determine how close to a “lossless” environment can be achieved, of course, as well as properly configured QoS mechanisms across the network. As a general rule of thumb, though, network segmentation is a very powerful tool to have for reliable storage delivery. Q. How close are we to shared NVMe storage either over Fiber or TCP? A. There are several shared storage products available on the market for NVMe over Fabrics, but as of this writing (only 3 months after the ratification of the protocol) no major vendors have announced NVMe over TCP shared storage capabilities. A good place to look for updates is on the NVM Express website for interoperability and compliance products. [https://nvmexpress.org/products/] Q. AQM -> DualQ work in IETF for coexisting L4S (DCTCP) and legacy TCP. Ongoing work @ chip merchants A. Indeed a lot of advancements around making TCP evolve as the speeds and feeds increase. This is yet another example that shows why NVMe/TCP is, and will remain, relevant in the future. Q. Are there any major vendors who are pushing products based on these technologies? A. We cannot comment publicly on any vendor plans. You would need to ask a vendor directly for a concrete timeframe for the technology. However, several startups have made public announcements on supporting NVMe/TCP. Lightbits Labs, to give one example, will have a high-performance low-latency NVMe/TCP-based software-defined-storage solution out very soon.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Networking Questions for Ethernet Scale-Out Storage

Fred Zhang

Dec 7, 2018

title of post
Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That's why the SNIA Networking Storage Forum (NSF) hosted a live webcast "Networking Requirements for Ethernet Scale-Out Storage." Our audience had some insightful questions. As promised, our experts are answering them in this blog. Q. How does scale-out flash storage impact Ethernet networking requirements? A.  Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it's more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it's more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions--such as RDMA, congestion management, and QoS features. Q. What are your thoughts on NVMe-oF TCP/IP and availability? A.  The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It's also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol. Q. Which is better: RoCE or iWARP? A.  SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast "RoCE vs. iWARP" and checking out the Q&A blog from that webcast. Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage? A.  Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage. Q. How does RDMA networking enable latency reduction? A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management. Q. Do all scale-out storage solutions have a separate cluster network? A.  Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.        

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Networking Questions for Ethernet Scale-Out Storage

Fred Zhang

Dec 7, 2018

title of post
Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “Networking Requirements for Ethernet Scale-Out Storage.” Our audience had some insightful questions. As promised, our experts are answering them in this blog. Q. How does scale-out flash storage impact Ethernet networking requirements? A. Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it’s more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it’s more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions–such as RDMA, congestion management, and QoS features. Q. What are your thoughts on NVMe-oF TCP/IP and availability? A. The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It’s also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol. Q. Which is better: RoCE or iWARP? A. SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast “RoCE vs. iWARP” and checking out the Q&A blog from that webcast. Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage? A. Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage. Q. How does RDMA networking enable latency reduction? A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management. Q. Do all scale-out storage solutions have a separate cluster network? A. Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.        

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

RDMA for Persistent Memory over Fabrics – FAQ

John Kim

Nov 14, 2018

title of post
In our most recent SNIA Networking Storage Forum (NSF) webcast Extending RDMA for Persistent Memory over Fabrics, our expert speakers, Tony Hurson and Rob Davis outlined extensions to RDMA protocols that confirm persistence and additionally can order successive writes to different memories within the target system. Hundreds of people have seen the webcast and have given it a 4.8 rating on a scale of 1-5! If you missed, it you can watch it on-demand at your convenience. The webcast slides are also available for download. We had several interesting questions during the live event. Here are answers from our presenters:  Q. For the RDMA Message Extensions, does the client have to qualify a WRITE completion with only Atomic Write Response and not with Commit Response? A. If an Atomic Write must be confirmed persistent, it must be followed by an additional Commit Request. Built-in confirmation of persistence was dropped from the Atomic Request because it adds latency and is not needed for some application streams. Q. Why do you need confirmation for writes? From my point of view, the only thing required is ordering. A. Agreed, but only if the entire target system is non-volatile! Explicit confirmation of persistence is required to cover the “gap” between the Write completing in the network and the data reaching persistence at the target. Q. Where are these messages being generated? Does NIC know when the data is flushed or committed? A. They are generated by the application that has reserved the memory window on the remote node. It can write using RDMA writes to that window all it wants, but to guarantee persistence it must send a flush. Q. How is RPM presented on the client host? A. The application using it sees it as memory it can read and write. Q. Does this RDMA commit response implicitly ACK any previous RDMA sends/writes to same or different MR? A. Yes, the new Commit (and Verify and Atomic Write) Responses have the same acknowledgement coalescing properties as the existing Read Response. That is, a Commit Response is explicit (non-coalesced); but it coalesces/implies acknowledgement of prior Write and/or Send Requests. Q. Does this one still have the current RMDA Write ACK? A. See previous general answer. Yes. A Commit Response implicitly acknowledges prior Writes. Q. With respect to the Race Hazard explained to show the need for explicit completion response, wouldn’t this be the case even with a non-volatile Memory, if the data were to be stored in non-volatile memory. Why is this completion status required only on the non-volatile case? A. Most networked applications that write over the network to volatile memory do not require explicit confirmation at the writer endpoint that data has actually reached there. If so, additional handshake messages are usually exchanged between the endpoint applications. On the other hand, a writer to PERSISTENT memory across a network almost always needs assurance that data has reached persistence, thus the new extension. Q. What if you are using multiple RNIC with multiple ports to multiple ports on a 100Gb fabric for server-to-server RDMA? How is order kept there…by CPU software or ‘NIC teaming plus’? A. This would depend on the RNIC vendor and their implementation. Q. What is the time frame for these new RDMA messages to be available in verbs API? A. This depends on the IBTA standards approval process which is not completely predicable, roughly sometime the first half of 2019. Q. Where could I find more details about the three new verbs (what are the arguments)? A. Please poll/contact/Google the IBTA and IETF organizations towards the end of calendar year 2018, when first drafts of the extension documents are expected to be available. Q. Do you see this technology used in a way similar to Hyperconverged systems now use storage or could you see this used as a large shared memory subsystem in the network? A. High-speed persistent memory, in either NVDIMM or SSD form factor, has enormous potential in speeding up hyperconverged write replication. It will require however substantial re-write of such storage stacks, moving for example from traditional three-phase block storage protocols (command/data/response) to an RDMA write/confirm model. More generally, the RDMA extensions are useful for distributed shared PERSISTENT memory applications. Q. What would be the most useful performance metrics to debug performance issues in such environments? A. Within the RNIC, basic counts for the new message types would be a baseline. These plus total stall times encountered by the RNIC awaiting Commit Responses from the local CPU subsystem would be useful. Within the CPU platform basic counts of device write and read requests targeting persistent memory would be useful. Q. Do all the RDMA NIC’s have to update their firmware to support this new VERB’s? What is the expected performance improvement with the new Commit message? A. Both answers would depend on the RNIC vendor and their implementation. Q. Will the three new verbs be implemented in the RNIC alone, or will they require changes in other places (processor, memory controllers, etc.)? A. The new Commit request requires the CPU platform and its memory controllers to confirm that prior write data has reached persistence. The new Atomic Write and Verify messages however may be executed entirely within the RNIC. Q. What about the future of NVMe over TCP – this would be much simpler for people to implement. Is this a good option? A. Again this would depend on the NIC vendor and their implementation. Different vendors have implemented various tests for performance. It is recommended that readers do their own due diligence.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Introducing the Networking Storage Forum

John Kim

Oct 9, 2018

title of post

At SNIA, we are dedicated to staying on top of storage trends and technologies to fulfill our mission as a globally recognized and trusted authority for storage leadership, standards, and technology expertise. For the last several years, the Ethernet Storage Forum has been working hard to provide high quality educational and informational material related to all kinds of storage.

From our "Everything You Wanted To Know About Storage But Were Too Proud To Ask" series, to the absolutely phenomenal (and required viewing) "Storage Performance Benchmarking" series to the "Great Storage Debates" series, we've produced dozens of hours of material.

Technologies have evolved and we've come to a point where there's a need to understand how these systems and architectures work – beyond just the type of wire that is used. Today, there are new systems that are bringing storage to completely new audiences. From scale-up to scale-out, from disaggregated to hyperconverged, RDMA, and NVMe-oF - there is more to storage networking than just your favorite transport. For example, when we talk about NVMe™ over Fabrics, the protocol is broader than just one way of accomplishing what you need. When we talk about virtualized environments, we need to examine the nature of the relationship between hypervisors and all kinds of networks. When we look at "Storage as a Service," we need to understand how we can create workable systems from all the tools at our disposal. Bigger Than Our Britches As I said, SNIA's Ethernet Storage Forum has been working to bring these new technologies to the forefront, so that you can see (and understand) the bigger picture. To that end, we realized that we needed to rethink the way that our charter worked, to be even more inclusive of technologies that were relevant to storage and networking. So... Introducing the Networking Storage Forum. In this group we're going to continue producing top-quality, vendor-neutral material related to storage networking solutions. We'll be talking about:
  • Storage Protocols (iSCSI, FC, FCoE, NFS, SMB, NVMe-oF, etc.)
  • Architectures (Hyperconvergence, Virtualization, Storage as a Service, etc.)
  • Storage Best Practices
  • New and developing technologies
... and more! Generally speaking, we'll continue to do the same great work that we've been doing, but now our name more accurately reflects the breadth of work that we do. We're excited to launch this new chapter of the Forum. If you work for a vendor, are a systems integrator, university or someone who manages storage, we welcome you to join the NSF. We are an active group that honestly has a lot of fun. If you're one of our loyal followers, we hope you will continue to keep track of what we're doing. And if you're new to this Forum, we encourage you to take advantage of the library of webcasts, white papers, and published articles that we have produced here. There's a wealth of un-biased, educational information there, we don't think you'll find anywhere else! If there's something that you'd like to hear about – let us know! We are always looking to hear about headaches, concerns, and areas of confusion within the industry where we can shed some light. Stay current with all things NSF:    

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Introducing the Networking Storage Forum

John Kim

Oct 9, 2018

title of post

At SNIA, we are dedicated to staying on top of storage trends and technologies to fulfill our mission as a globally recognized and trusted authority for storage leadership, standards, and technology expertise. For the last several years, the Ethernet Storage Forum has been working hard to provide high quality educational and informational material related to all kinds of storage.

From our “Everything You Wanted To Know About Storage But Were Too Proud To Ask” series, to the absolutely phenomenal (and required viewing) “Storage Performance Benchmarking” series to the “Great Storage Debates” series, we’ve produced dozens of hours of material.

Technologies have evolved and we’ve come to a point where there’s a need to understand how these systems and architectures work – beyond just the type of wire that is used. Today, there are new systems that are bringing storage to completely new audiences. From scale-up to scale-out, from disaggregated to hyperconverged, RDMA, and NVMe-oF – there is more to storage networking than just your favorite transport. For example, when we talk about NVMe™ over Fabrics, the protocol is broader than just one way of accomplishing what you need. When we talk about virtualized environments, we need to examine the nature of the relationship between hypervisors and all kinds of networks. When we look at “Storage as a Service,” we need to understand how we can create workable systems from all the tools at our disposal. Bigger Than Our Britches As I said, SNIA’s Ethernet Storage Forum has been working to bring these new technologies to the forefront, so that you can see (and understand) the bigger picture. To that end, we realized that we needed to rethink the way that our charter worked, to be even more inclusive of technologies that were relevant to storage and networking. So… Introducing the Networking Storage Forum. In this group we’re going to continue producing top-quality, vendor-neutral material related to storage networking solutions. We’ll be talking about:
  • Storage Protocols (iSCSI, FC, FCoE, NFS, SMB, NVMe-oF, etc.)
  • Architectures (Hyperconvergence, Virtualization, Storage as a Service, etc.)
  • Storage Best Practices
  • New and developing technologies
… and more! Generally speaking, we’ll continue to do the same great work that we’ve been doing, but now our name more accurately reflects the breadth of work that we do. We’re excited to launch this new chapter of the Forum. If you work for a vendor, are a systems integrator, university or someone who manages storage, we welcome you to join the NSF. We are an active group that honestly has a lot of fun. If you’re one of our loyal followers, we hope you will continue to keep track of what we’re doing. And if you’re new to this Forum, we encourage you to take advantage of the library of webcasts, white papers, and published articles that we have produced here. There’s a wealth of un-biased, educational information there, we don’t think you’ll find anywhere else! If there’s something that you’d like to hear about – let us know! We are always looking to hear about headaches, concerns, and areas of confusion within the industry where we can shed some light. Stay current with all things NSF:    

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to RDMA