Have you ever wondered how RDMA (Remote Direct Memory Access) actually works? You’re not alone, that’s why the SNIA Data, Storage & Networking Community (DSN) hosted a live webinar, “Everything You Wanted to Know About RDMA But Were Too Proud to Ask” where our expert presenters, Michal Kalderon and Rohan Mehta explained how RDMA works and the essential role it plays for AI/ML workloads due to its ability to provide high-speed, low-latency data transfer. The presentation is on demand along with the webinar slides in the SNIA Educational Library.
The live audience was not “too proud” to ask questions, and our speakers have graciously answered all of them here.
Q: Does the DMA chip ever reside on the newer CPUs or are they a separate chip?
A: Early system designs used an external DMA Controller. Modern systems have moved to an integrated DMA controller design. There is a lot of information about the evolution of the DMA Controller available online. The DMA Wikipedia article on this topic is a good place to start.
Q: Slide 51: For RoCEv2, are you routing within the rack, across the data center, or off site as well?
A: RoCEv2 operates over Layer 3 networks, and typically requires a lossless network environment achieved through mechanisms like PFC, DCQCN, and additional congestion control methods discussed in the webinar. These mechanisms are easier to implement and manage within controlled network domains, such as those found in data centers. While theoretically, RoCEv2 can span multiple data centers, it is not commonly done due to the complexity of maintaining lossless conditions across such distances.
Q: You said that WQEs have opcodes. What do they specify and how are they defined?
A: The WQE opcodes are the actual RDMA operations referred to in slide 16, 30.
SEND, RECV, WRITE, READ, ATOMIC. For each one of these operations there are additional fields that can be set.
Q: is there a mistake in slide 27? QP-3 on the receive side (should they be labeled QP-1, QP-2 and QP3) or am I misunderstanding something?
A: Correct, good catch. The updated deck is here.
Q: Is the latency deterministic after connection?
A: Similar to all network protocols, RDMA is subject to factors such as network congestion, interfering data packets, and other network conditions that can affect latency.
Q: How does the buffering works? If a server sends data and the client is unable to receive all the data due to buffer size limitations?
A: We will split the answer for the two groups of RDMA operations.
- (a) Channel Semantics: Send and Recv. In this case, the application must make sure that the receiver has posted an RQ buffer before performing the send operation. This is typically done by the client side posting RQ buffers before initiating a connection request. If the send arrives and there is no RQ buffer posted, the packets are dropped and a RNR NAK message is sent to the sender. There is a configurable parameter for the QP called, rnr_retry_cnt, which specifies how many times the RNIC should try resending messages if it gets and RNR NAK.
- (b) For Memory semantics, Read, Write, Atomic, this cannot occur since the data is placed on pre-registered memory in a specific location.
Q: How/where are the packet headers handled?
A: The packet headers are handled by the R-NIC (RDMA NIC).
Q: How Object (S3) over RDMA is implemented at high level, does it still involve HTTP/S?
A: There are proprietary solutions that provide Object (S3) over RDMA, but there is currently no industry standard available that could be used to create non-vendor specific implementations that would be interoperable with one another.
Q: How much packet per second transport for single core x86 vs ARM Server single core over TCP/IP?
A: When measuring RDMA performance, you measure messages per second and not packets-per-second as opposed to TCP, you don’t have per-packet processing in the host. The performance depends more on the R-NIC than the host core, as it bypassed CPU processing. If you’d like a performance comparison between RDMA (RoCE) and TCP, please refer to the “NVMe-oF Looking Beyond Performance Hero Numbers” webinar.
Q: Could you clarify the reason for higher latency using interrupt method?
A: It depends upon the mode of operation:
Polling Mode: In polling mode, the CPU continuously checks (or "polls") the completion queue for new events. This constant checking eliminates the delay associated with waiting for an interrupt signal, leading to faster detection and handling of events. However, this comes at the cost of higher CPU utilization since the CPU is always active, even when there are no events to process.
Interrupt Mode: In interrupt mode, the CPU is notified of new events via interrupts. When an event occurs, an interrupt signal is sent to the CPU, which then stops its current task to handle the event. This method is more efficient in terms of CPU usage because the CPU can perform other tasks while waiting for events. However, the process of generating, delivering, and handling interrupts introduces additional latency compared to the immediate response of polling
Q: Slide 58: Does RDMA complement or compete with CXL?
A: They are not directly related. RDMA is a protocol used to perform Remote DMA operations (i.e., over a network of some kind) and CXL is a used to provide high-speed, coherent communication within a single system. CXL.mem allows devices to access memory directly within a single system or a small tightly coupled group of systems. If this question was specific to DMA, as opposed to RDMA, the answer would be slightly different.
Q: The RDMA demonstration had MTU size on the RDMA NIC set to 1K. Does RDMA traffic benefit from setting the MTU size to a larger setting (3k-9k MTU size) or is that really dependent on the amount of traffic the RoCE application generates over the RDMA NIC?
A: RDMA traffic, similar to other protocols like TCP, can benefit from setting the MTU size to a larger setting. It can reduce packet processing overhead and improve throughput.
Q: When app send read/write request, how app gets remote site RKEY info? Q2 - Is it possible to tweak LKEY pointing to same buffer for debugging any memory related issue? Fairly new to topic so apologies in advance if any query doesn't make sense.
A: The RKEYs can be exchanged using the channel semantics, where SEND/RECV are used. In this case there is no need for an r-key as the message will arrive to the first buffer posted in the RQ buffer on the peer. LKEY refers to local memory. Every registered memory LKEY and RKEY will point to same location. The LKEY is used by the local RNIC to access the memory and RKEY will be provided to the remote application to access the memory.
Q: What does SGE stand for?
A: Scatter Gather Element. An element inside an SGL – Scatter Gather List. Used to reference non-contiguous memory.
Thanks to our audience for all these great questions. We encourage you to join us at future SNIA DSN webinars. Follow us @SNIA and on LinkedIn for upcoming webinars. This webinar was part of the “Everything You Wanted to Know But Were Too Proud to Ask” SNIA webinar series. If you found the information in this RDMA webinar helpful, I encourage you to check out the many other ones we have produced. They are all available here on the SNIAVideo YouTube Channel.
Leave a Reply