Blog

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

Oct 8, 2018

For datacenter applications requiring low-latency access to persistent storage, byte-addressable persistent memory (PM) technologies like 3D XPoint and MRAM are attractive solutions. Network-based access to PM, labeled here Persistent Memory over Fabrics (PMoF), is driven by data scalability and/or availability requirements. Remote Direct Memory Access (RDMA) network protocols are a good match for PMoF, allowing direct RDMA data reads or writes from/to remote PM. However, the completion of an RDMA Write at the sending node offers no guarantee that data has reached persistence at the target. Join the Networking Storage Forum (NSF) on October 25, 2018 for out next live webcast, Extending RDMA for Persistent Memory over Fabrics. In this webcast, we will outline extensions to RDMA protocols that confirm such persistence and additionally can order successive writes to different memories within the target system. Learn:

Why we can't just treat PM just like traditional storage or volatile memory
What happens when you write to memory over RDMA
Which programming model and protocol changes are required for PMoF
How proposed RDMA extensions for PM would work

We believe this webcast will appeal to developers of low-latency and/or high-availability datacenter storage applications and be of interest to datacenter developers, administrators and users. I encourage you to register today. Our NSF experts will be on hand to answer you questions. We look forward to your joining us on October 25^th.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Persistent Memory RDMA

Blog

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

John Kim

Oct 8, 2018

Why we can’t just treat PM just like traditional storage or volatile memory
What happens when you write to memory over RDMA
Which programming model and protocol changes are required for PMoF
How proposed RDMA extensions for PM would work

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

3D XPoint Persistent Memory RDMA

Blog

RoCE vs. iWARP Q&A

RoCE vs. iWARP Q&A

John Kim

Sep 19, 2018

In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our "Great Storage Debate" webcast series. If you haven't seen it yet, it's now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all. Q. Are RDMA NIC's and TOE NIC's the same? What are the differences? A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb's API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don't use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS's (e.g. FreeBSD) support RoCE and iWARP? A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? A. The NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What's the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you've got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify -- RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a "Great Storage Debate?" Here is an area where there are two schools of thoughts. Answer #1: It's recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks? A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use "Resilient RoCE" without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it's own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so "Converged Ethernet" requirement doesn't apply to iWARP. Q. It's not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we'd likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000's of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It's unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn't the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn't require lossless but doesn't one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA "limited" but even if that's true don't they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP's performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor's implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other "Great Storage Debate" in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ethernet Data Storage RDMA

Blog

RoCE vs. iWARP Q&A

RoCE vs. iWARP Q&A

John Kim

Sep 19, 2018

In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our “Great Storage Debate” webcast series. If you haven’t seen it yet, it’s now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all. Q. Are RDMA NIC’s and TOE NIC’s the same? What are the differences? A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb’s API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS’s (e.g. FreeBSD) support RoCE and iWARP? A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? A. The NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What’s the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you’ve got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify — RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a “Great Storage Debate?” Here is an area where there are two schools of thoughts. Answer #1: It’s recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks? A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use “Resilient RoCE” without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it’s own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so “Converged Ethernet” requirement doesn’t apply to iWARP. Q. It’s not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we’d likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000’s of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It’s unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn’t the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn’t require lossless but doesn’t one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA “limited” but even if that’s true don’t they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP’s performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor’s implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other “Great Storage Debate” in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ethernet Data Storage iWARP RDMA RoCE

Blog

RoCE vs. iWARP – The Next “Great Storage Debate”

RoCE vs. iWARP – The Next “Great Storage Debate”

John Kim

Jul 16, 2018

By now, we hope you’ve had a chance to watch one of the webcasts from the SNIA Ethernet Storage Forum’s “Great Storage Debate” webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, and FCoE vs. iSCSI vs. iSER. The goal of this series is not to have a winner emerge, but rather educate the attendees on how the technologies work, advantages of each, and common use cases. Our next great storage debate will be on August 22, 2018 where our experts will debate RoCE vs. iWARP. They will discuss these two commonly known RDMA protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. Both are Ethernet-based RDMA technologies that can increase networking performance. Both reduce the amount of CPU overhead in transferring data among servers and storage systems to support network-intensive applications, like networked storage or clustered computing. Join us on August 22^nd, as we’ll address questions like:

Both RoCE and iWARP support RDMA over Ethernet, but what are the differences?
Use cases for RoCE and iWARP and what differentiates them?
UDP/IP and TCP/IP: which RDMA standard uses which protocol, and what are the advantages and disadvantages?
What are the software and hardware requirements for each?
What are the performance/latency differences of each?

Get this on your calendar by registering now. Our experts will be on-hand to answer your questions on the spot. We hope to see you there! Visit snia.org to learn about the work SNIA is doing to lead the storage industry worldwide in developing and promoting vendor-neutral architectures, standards, and educational services that facilitate the efficient management, movement, and security of information.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ethernet Data Storage iWARP RDMA RDMA. RoCE

Blog

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

Alex McDonald

May 1, 2018

As a follow up our first two hugely successful "Great Storage Debate" webcasts, Fibre Channel vs. iSCSI and File vs. Block vs. Object Storage, the SNIA Ethernet Storage Forum will be presenting another great storage debate on June 21, 2018. This time we'll take on FCoE vs. iSCSI vs. iSER. For those of you who've seen these webcasts, you know that the goal of these debates is not to have a winner emerge, but rather provide unbiased education on the capabilities and use cases of these technologies so that attendees can become more informed and make educated decisions. Here's what you can expect from this session: One of the features of modern data centers is the ubiquitous use of Ethernet. Although many data centers run multiple separate networks (Ethernet and Fibre Channel (FC)), these parallel infrastructures require separate switches, network adapters, management utilities and staff, which may not be cost effective. Multiple options for Ethernet-based SANs enable network convergence, including FCoE (Fibre Channel over Ethernet) which allows FC protocols over Ethernet and Internet Small Computer System Interface (iSCSI) for transport of SCSI commands over TCP/IP-Ethernet networks. There are also new Ethernet technologies that reduce the amount of CPU overhead in transferring data from server to client by using Remote Direct Memory Access (RDMA), which is leveraged by iSER (iSCSI Extensions for RDMA) to avoid unnecessary data copying. That leads to several questions about FCoE, iSCSI and iSER:

If we can run various network storage protocols over Ethernet, what differentiates them?
What are the advantages and disadvantages of FCoE, iSCSI and iSER?
How are they structured?
What software and hardware do they require?
How are they implemented, configured and managed?
Do they perform differently?
What do you need to do to take advantage of them in the data center?
What are the best use cases for each?

Register today to join our SNIA experts as they answer all these questions and more on the next Great Storage Debate: FCoE vs. iSCSI vs. iSER. We look forward to seeing you on June 21^st.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ethernet Data Storage FCoE iSCSI RDMA

Blog

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

AlexMcDonald

May 1, 2018

As a follow up our first two hugely successful “Great Storage Debate” webcasts, Fibre Channel vs. iSCSI and File vs. Block vs. Object Storage, the SNIA Ethernet Storage Forum will be presenting another great storage debate on June 21, 2018. This time we’ll take on FCoE vs. iSCSI vs. iSER. For those of you who’ve seen these webcasts, you know that the goal of these debates is not to have a winner emerge, but rather provide unbiased education on the capabilities and use cases of these technologies so that attendees can become more informed and make educated decisions. Here’s what you can expect from this session: One of the features of modern data centers is the ubiquitous use of Ethernet. Although many data centers run multiple separate networks (Ethernet and Fibre Channel (FC)), these parallel infrastructures require separate switches, network adapters, management utilities and staff, which may not be cost effective. Multiple options for Ethernet-based SANs enable network convergence, including FCoE (Fibre Channel over Ethernet) which allows FC protocols over Ethernet and Internet Small Computer System Interface (iSCSI) for transport of SCSI commands over TCP/IP-Ethernet networks. There are also new Ethernet technologies that reduce the amount of CPU overhead in transferring data from server to client by using Remote Direct Memory Access (RDMA), which is leveraged by iSER (iSCSI Extensions for RDMA) to avoid unnecessary data copying. That leads to several questions about FCoE, iSCSI and iSER:

If we can run various network storage protocols over Ethernet, what differentiates them?
What are the advantages and disadvantages of FCoE, iSCSI and iSER?
How are they structured?
What software and hardware do they require?
How are they implemented, configured and managed?
Do they perform differently?
What do you need to do to take advantage of them in the data center?
What are the best use cases for each?

Register today to join our SNIA experts as they answer all these questions and more on the next Great Storage Debate: FCoE vs. iSCSI vs. iSER. We look forward to seeing you on June 21^st.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Ethernet Data Storage FCoE iSCSI iSER RDMA

Blog

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

Alex McDonald

Mar 7, 2018

The SNIA Ethernet Storage Forum recently hosted the first of our "Great Debates" webcasts on Fibre Channel vs. iSCSI. The goal of this series is not to have a winner emerge, but rather provide vendor-neutral education on the capabilities and use cases of these technologies so that attendees can become more informed and make educated decisions. And it worked! Over 1,200 people have viewed the webcast in the first three weeks! And the comments from attendees were exactly what we had hoped for:

"A good and frank discussion about the two technologies that don't always need to compete!"

"Really nice and fair comparison guys. Always well moderated, you hit a lot of material in an hour. Thanks for your work!"

"Very fair and balanced overview of the two protocols."

"Excellent coverage of the topic. I will have to watch it again."

If you missed the webcast, you can watch it on-demand at your convenience and download a copy of the slides. The debate generated many good questions and our expert speakers have answered them all: Q. What is RDMA? A. RDMA is an acronym for Remote Direct Memory Access. It is a part of a protocol through which memory addresses are exchanged between end points so that data is able to move directly from the memory in one end point over the network to the memory in the other end point, without involving the end point CPUs in the data transfer. Without RDMA, intermediate copies (sometimes, multiple copies) of the data are made on the source end point and the destination end point. RoCEv1, RoCEv2, iWARP, and, and InfiniBand are all protocols that are capable of performing RDMA transfers. iSER is iSCSI over RDMA often uses iWARP or RoCE. SRP is a SCSI RDMA protocol that runs only over InfiniBand. FC uses hardware based DMA to perform transfers without the need to make intermediate copies of the data, therefore RDMA is not needed for FC, and does not apply to FC. Q. Can multi-pathing be used for load balancing or high availability? A. Multi-pathing is used both for load balancing and for high availability. In an active-passive setup it is used only for high-availability, while in an active-active setup it is used for both. Q. Some companies are structured so that iSCSI is handled by network services and the storage team supports FC, so there is storage and network overlap. Network people should be aware of storage and reverse. A. Correct, one of the big tradeoffs between iSCSI and FC may end up not being a technology tradeoff at all. In some environments, the political and organizational structure plans as much a part of the technology decision as the technology itself. Strong TCP/IP network departments may demand that they manage everything, or they may demand that storage traffic be kept as far from their network as possible. Strong storage network departments may demand their own private networks (either TCP/IP for iSCSI, or FC). In the end, the politics may play as important a role in the decision of iSCSI vs. FC as the actual technology itself. Q. If you have an established storage network (i.e. FC/iSCSI) is there a compelling reason you would switch? A. Typically, installations grow by adding to their existing configuration (iSCSI installations typically add more iSCSI, and FC installations add more FC). Switching from one technology to another may occur for various reasons (for example, the requirements of the organization have changed such that the other technology better meets the organizational needs or a company merger dictates a change). Fibre Channel is at 32/128Gb now. iSCSI is already in product at 100Gb, and at 200/400Gb next and so on. In short, Ethernet currently has a shorter speed upgrade cycle than FC. This is especially important now that SSDs have arrived on the scene. With the performance available from the SSD's, the SAN is now the potential choke point. With the arrival of Persistent Memory, this problem can be exacerbated yet again and there the choice of network architectures will be important. One of the reasons why people might switch has very little to do with the technology, but more to do with other ancillary reasons. For instance, iSCSI is something of an "outside-in" management paradigm, while Fibre Channel has more of an "inside-out" paradigm. That is, management is centralized in FC, where iSCSI has many more touch-points [link: http://brasstacksblog.typepad.com/brass-tacks/2012/02/fc-and-fcoe-versus-iscsi-network-centric-versus-end-node-centric-provisioning.html]. When it comes to consistency at scale, there are major differences in how each storage network handles management as well as performance. Likewise, if programmability and network ubiquity is more important, then Ethernet-based iSCSI is an appealing technology to consider. Q. Are certain storage vendors recommending FC over iSCSI for performance reasons because of how their array software works? A. Performance is not the only criteria, and vendors should be careful to assess their customers' needs before recommending one solution over another. If you feel that a vendor is proposing X because, well, X is all they have, then push back and insist on getting some facts that support their recommendation. Q. Which is better for a backup solution? A. Both FC and iSCSI can be used to backup data. If a backup array is emulating a tape library, this is usually easier to do with FC than with iSCSI. Keep in mind that many backup solutions will run their own protocol over Ethernet, without using either iSCSI or FC. Q. I disagree that Ethernet is cheaper. If you look at cost of the 10/25Gb SFP+/SFP28 transceivers required vs. 16/32Gb transceiver costs, the FC solution is on par or in some cases, cheaper than Ethernet solutions. If you limit Ethernet to 10GBASE-T, then yes, it is cheaper. A. This is part of comparing apples to apples (and not to pineapples). iSCSI solutions typically are available in a wider range of price choices from 1Gb to 100Gb speeds (there are more lower cost solutions available with iSCSI than with FC). But, when you compare environments with comparable features, typically, the costs of each solution are similar. Note that 10/25Gb Ethernet supports DAC (direct-attach copper) cables over short distancesâ€”such as within a rack or between adjacent racksâ€”which do not require separate transceivers. Q. Do you know of a vendor that offers storage arrays with port speed higher than 10Gbs? How is 50Gbs and 100Gbs Ethernet relevant if it's not available from storage vendors? A. It's available now for top-of-rack switches and from flash storage startups, as well as a few large storage OEMs supporting 40GbE connections. Additional storage systems will adopt it when it becomes necessary to support greater than 1GB (that's a gigabyte!) per second of data movement from a single port, and most storage systems already offer multiples of 10Gbps ports on a single system. 100GbE iSCSI is in qualification now and we expect there will be offerings from tier-1 storage OEMs later this year. Similarly, higher-speed Fibre Channel port speeds are in the works. However, it's important to note that at the port level, speed is not the only consideration: port configuration becomes increasingly important (e.g., it is possible to aggregate Fibre Channel ports up to 16x the speed of each individual port; Ethernet aggregation is possible too, but it works differently). Q. Why there are so few vendors in FC space? A. Historically, FC started with many vendors. Over the life of FC development, a fair number of mergers and acquisitions has reduced the number of vendors in this space. Today, there are 2 primary switch vendors and 2 primary adapter vendors. Q. You talk about reliable, but how about stable and predictable? A. Both FC and iSCSI networks can be very stable and predictable. Because FC-SAN is deployed only for storage and has fewer vendors with well-known configurations, it may be easier to achieve the highest levels of stability and predictability with less effort when using FC-SAN. iSCSI/Ethernet networks have more setup options and more diagnostic or reporting tools available so may be easier to monitor and manage iSCSI networks at large scale once configured. Q. On performance, what's the compare on speed related to IOPs for FC vs. iSCSI? A. IOPS is largely a function of latency and secondarily related to hardware offloads and bandwidth. For this reason, FC and iSCSI connections typically offer similar IOPS performance if they run at similar speeds and with similar amounts of hardware offload from the adapter or HBA. Benchmark reports showing very high IOPS performance for both iSCSI and Fibre Channel are available from 3^rd party analysts. Q. Are there fewer FC ports due to the high usage of blade chassis that share access or due to more iSCSI usage? A. It is correct that most blade servers use Ethernet (FCoE, iSCSI, or NFS), but this is a case of comparing apples and pineapples. FC ports are used for storage in a data center. Ethernet ports can be used for storage in a data center, but they are also used in laptops and desktops for e-mail, web browsing; wireless control of IoT (Internet of Things - e.g., light bulbs, thermostats, etc.); cars (yes, modern automobiles have their own Ethernet network); and many other things. So, if you compare the number of data center storage ports to the number of every other port used for every other type of network traffic, yes, there will be a smaller number associated with only the data center storage ports. Q. Regarding iSCSI offload cards, we used to believe that software initiators were faster because they could leverage the fast chips in the server. Have iSCSI offload cards changed significantly in recent years? A. This has traditionally been a function of the iSCSI initiator offload architecture. A full/cmd offload solution tends to be slower since it executes the iSCSI stack on a slow processor firmware in the NIC. A modern PDU-based solution (such as supported by the Open-iSCSI on Linux), only offloads performance critical applications to the silicon and is just as low latency as the software initiator and perhaps lower. Q. I think one of the more important differences between FC and iSCSI is that a pure FC network is not routable whereas iSCSI is, because of the nature of the protocol stack each one relies on. Maybe in that sense iSCSI has an advantage, especially when we think in hybrid cloud scenarios increasingly more common today. Am I right? A. Routability is usually discussed in the context of the TCP/IP network layering model i.e. how traffic moves through different Ethernet switches/routers and IP domains to get from the source to the destination. iSCSI is built on top of TCP/IP, and, hence, iSCSI benefits from interoperating with existing Ethernet switching/routing infrastructure, and not requiring special gateways when leaving the data center, for example, in the hybrid cloud case. The industry has already developed other standards to carry Fibre Channel over IP: FCIP. FCIP is routable, and it is already part of the FC-BB-5 standard that will also include FCoE. Q. This is all good info, but this is all contingent on the back-end storage, inclusive of the storage array/SAN/NAS/server and disks/SSD/NVMe, actually being able to take advantage of the bandwidth. SAN vendors have been very slow to adopt these larger pipes. A. New technologies have adoption curves, and to be frank, adoption up the network speed curve has been slow since 10Gbps. A lot of that is due to disk technologies; they haven't gotten that much faster in the last decade (bigger, yes, but not faster; it's difficult to drive a big expensive pipe efficiently with slow drives.). Now with SSD and NVMe (and other persistent memories technologies to come), device latency and bandwidth have become a big issue. That will drive the adoption not only of fatter pipes for bandwidth, but also RDMA technologies to reduce latency. Q. What is a good source of performance metrics for data on CPU requirements for pushing/pulling data. This is in reference to the topic of "How can a server support 100/ Gb/s?" Q. Once 100Gb iSCSI is offloaded via special adapter cards, there should be no additional load imposed on the server than any other 100Gb link would require. Websites of independent testing companies (e.g. Demartek) should provide specific information in this regard. Q. What about iSCSI TLV A. This is a construct for placing iSCSI traffic on specific classes of service in a DCBX switch environment, which in turn is used when using a no-drop environment for iSCSI traffic; i.e., it's used for "lossless iSCSI." iSCSI TLV is a configuration setting, not a performance setting. All it does is allow an Ethernet switch to tell an adapter which Class of Service (COS) setting it's using. However, this is actually not necessary for iSCSI, and in some cases [see e.g. https://blogs.cisco.com/datacenter/the-napkin-dialogues-lossless-iscsi] may actually be undesirable. iSCSI is built on TCP and it inherits the reliability features from the underlying TCP layer and does not need a DCBX infrastructure. In the case of hardware offloaded iSCSI, if a loss is observed in the system, the TCP retransmissions happen at silicon speeds without perturbing the host software and the resulting performance impact to the iSCSI traffic is insignificant. Further, Ethernet speeds have been rising rapidly, and have been overcoming any need for any type of traffic pacing. Q. How far away is standard-based NVMe over 100G Ethernet? Surely once 100GE starts to support block storage applications, is 128G FC now unattractive? A. NVMe over Fabrics (NVMe™-oF) is a protocol that is independent of the underlying transport network. That is, the protocol can accept any speed of the transport underneath. The key thing, then, is when you will find Operating System support for running the protocol with faster transport speeds. For instance, NVMe-oF over 10/25/40/50/100G Ethernet is available with RHEL7.4 and RHEL7.5. NVMe-oF over high-speed Fibre Channel will be dependent upon the adapter manufacturers' schedule, as the qualification process is a bit more thorough. It may be challenging for FC to keep up with the Ethernet ecosystem, either in price, or with the speed of introducing new speed bumps, due to the much larger Ethernet ecosystem, but the end-to-end qualification process and ability to run multi-protocol deterministic storage with Fibre Channel networks often surpass raw speeds for practical use. Q. Please comment on the differences/similarities from the perspectives of troubleshooting issues. A. Both Fibre Channel and iSCSI use similar troubleshooting techniques. Tools such as traceroute, ping, and others (the names may be different, but the functionality is the same) are common across both network types. Fibre Channel's troubleshooting tools are available at both the adapter level and the switch level, but since Fibre Channel has the concept of a fabric, many of the tools are system-wide. This allows for many common steps to be taken in one centralized management location. Troubleshooting of TCP/IP layer of iSCSI is no different than the rest of TCP/IP that the IT staff is used to and standard debugging tools work. Troubleshooting the iSCSI layer is very similar to FC since they both essentially appear as SCSI and essentially offer the same services. Q. Are TOE cards required today? A. TOE cards are not required today. TCP Offload Engines (TOEs) have both advantages and disadvantages. TOEs are more expensive than ordinary non-TOE Network Interface Chips (NICs). But, TOEs reduce the CPU overhead involved with network traffic. In some workloads, the extra CPU overhead of a normal NIC is not a problem, but in other heavy network workloads, the extra CPU overhead of the normal NIC reduces the amount of work that the system is able to perform, and the TOE provides an advantage (by freeing up extra CPU cycles to perform real work). For 10Gb, you can do without an offload card if you have enough host CPU cycles at your disposal, or in the case of a target, if you are not aggregating too many initiators, or are not using SSDs and do not need the high IOPs. At 40Gb and above, you will likely need offload assist in your system. Q. Are queue depths the same for both FC and iSCSI? Or are there any differences? A. Conceptually, the concepts of queue depth are the same. At the SCSI layer, queue depth is the number of commands that a LUN may be concurrently operated on. When that number of outstanding commands is achieved, the LUN refuses to accept any additional commands (any additional commands are failed with the status of TASK SET FULL). As a SCSI layer concept, the queue depth is not impacted by the transport type (iSCSI or FC). There is no relationship between this value and concepts such as FC Buffer Credits, or iSCSI R2T (Ready to Transfer). In addition, some adapters have a limit on the number of outstanding commands that may be present at the adapter layer. As a result of interactions between the queue depth limits of an individual LUN, and the queue depths limits of the adapters, hosts often allow for administrative management of the queue depth. This management enables a system administrator to balance the IO load across LUNs so that a single busy LUN does not consume all available adapter resources. In this case, the queue depth value set at the host is used by the host as a limiter of the number of concurrent outstanding commands (rather than waiting for the LUN to report the TASK SET FULL status) Again, management of these queue depth values is independent of the transport type. However, on some hosts, the management of queue depth may appear different (for example, the commands used to set a maximum queue depth for a LUN on an FC transport vs. a LUN on an iSCSI transport may be different). Q. Is VMware happy more with FC or ISCSI, assuming almost the same speed? What about the network delay in contrast with the FC protocol which (is/was faster)? A. Unfortunately, we can't comment on individual company's best practice recommendations. However, you can refer to VMware's Best Practices Guides for Fibre Channel and iSCSI: Best Practices for Fibre Channel Storage Best Practices For Running VMware vSphere on iSCSI Q. Does iSCSI have true load balancing when Ethernet links are aggregated? Meaning the links share even loads? Can it be done across switches? I'm trying to achieve load balancing and redundancy at the same time. A. In most iSCSI software as well as hardware offload implementations load-balancing is supported using "multi-pathing" between a server and storage array which provides the ability to load-balance between paths when all paths are present and to handle failures of a path at any point between the server and the storage. Multi-pathing is also a de facto standard for load-balancing and high-availability in most Fibre Channel SAN environments. Q. Between FC and iSCSI, what are the typical workloads for one or the other? A. It's important to remember that both Fibre Channel and iSCSI are block storage protocols. That is, for applications and workloads that require block storage, both Fibre Channel and iSCSI are relevant. From a connectivity standpoint, there is not much difference between the protocols at a high level – you have an initiator in the host, a switch in the middle, and a storage target at the other end. What becomes important, then, is topologies and architectures. Fibre Channel has a tightly-controlled oversubscription ratio, which is the number of hosts that we allow to access a single storage device (ratios can fall between, typically 4:1 to 20:1, depending on the application). iSCSI, on the other hand, has a looser relationship with oversubscription ratios, and can often be several dozen to 1 storage target. Q. For IPSEC for iSCSI, are there hardware offload capabilities to do the encryption/decryption in iSCSI targets available, or is it all done in software? A. Both hardware offload and software solutions are available. The tradeoffs are typically cost. With a software solution, you pay the cost in extra overhead in the CPU. If your CPU is not already busy, then that cost is very low (you may not even notice). If however, your CPU is busy, then the overhead of IPSEC will slow down your application from getting real work done. With the hardware offload solution, the cost is the extra $$ to purchase the hardware itself. On the upside, the newest CPUs offer new instructions for reducing the overhead of the software processing of various security protocols. Chelsio's T6 offers integrated IPSec and TLS offload. This encryption capability can be used either for data-at-rest purposes (independent of the network link), or can be used in conjunction with the iSCSI (but requires a special driver). The limitation of the special driver will be removed in the next generation. Q. For any of the instruction participants: Are there any dedicated FC/iSCSI detailed installation guides (for dummies) you use or recommend from any vendor? A. No, there isn't a single set of installation guides, as the best practices vary by storage and network vendor. Your storage or network vendor is the best place to start. Q. If iSCSI is used in a shared mode, how is the performance? A. Assuming this refers to sharing the link (pipe), iSCSI software and hardware implementations may be configured to utilize a portion of the aggregate link bandwidth without affecting performance. Q. Any info on FCoE (Fibre Channel over Ethernet)? A. There are additional talks on FCoE available from the SNIA site: On-demand webcasts:

Blogs:

In summary, FCoE is an encapsulation of the FC protocol into Ethernet packets that are carried over an Ethernet wire (without the use of TCP or IP). Q. What is FC's typical network latency in relation to storage access and compare to iSCSI? A. For hardware-offloaded iSCSI, the latency is essentially the same since both stacks are processed at silicon speeds. Q. With 400Gbps Ethernet on the horizon, cloud providers and enterprises adopting Hyper-converged architectures based on Ethernet, isn't it finally death of FC, at least in the mainstream, with exception of some niche verticals, which also still run mainframes? A. No, tape is still with us, and its demise has been predicted for a long time. There are still good reasons for investing in FC; for example, sunk costs, traditional environments and applications, and the other advantages explained in the presentation. The above said, the ubiquity of the Ethernet ecosystem which drives features/performance/lower-cost has been and will continue to be a major challenge for FC. And so, the FC vs. iSCSI debate continues, Ready for another "Great Debate?" So are we, register now for our next live webcast "File vs. Block vs. Object" on April 17^th. We hope to see you there!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Fibre Channel iSCSI RDMA

Blog

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

AlexMcDonald

Mar 7, 2018

The SNIA Ethernet Storage Forum recently hosted the first of our “Great Debates” webcasts on Fibre Channel vs. iSCSI. The goal of this series is not to have a winner emerge, but rather provide vendor-neutral education on the capabilities and use cases of these technologies so that attendees can become more informed and make educated decisions. And it worked! Over 1,200 people have viewed the webcast in the first three weeks! And the comments from attendees were exactly what we had hoped for:

“A good and frank discussion about the two technologies that don’t always need to compete!”

“Really nice and fair comparison guys. Always well moderated, you hit a lot of material in an hour. Thanks for your work!”

“Very fair and balanced overview of the two protocols.”

“Excellent coverage of the topic. I will have to watch it again.”

-oF) is a protocol that is independent of the underlying transport network. That is, the protocol can accept any speed of the transport underneath. The key thing, then, is when you will find Operating System support for running the protocol with faster transport speeds. For instance, NVMe-oF over 10/25/40/50/100G Ethernet is available with RHEL7.4 and RHEL7.5. NVMe-oF over high-speed Fibre Channel will be dependent upon the adapter manufacturers’ schedule, as the qualification process is a bit more thorough. It may be challenging for FC to keep up with the Ethernet ecosystem, either in price, or with the speed of introducing new speed bumps, due to the much larger Ethernet ecosystem, but the end-to-end qualification process and ability to run multi-protocol deterministic storage with Fibre Channel networks often surpass raw speeds for practical use. Q. Please comment on the differences/similarities from the perspectives of troubleshooting issues. A. Both Fibre Channel and iSCSI use similar troubleshooting techniques. Tools such as traceroute, ping, and others (the names may be different, but the functionality is the same) are common across both network types. Fibre Channel’s troubleshooting tools are available at both the adapter level and the switch level, but since Fibre Channel has the concept of a fabric, many of the tools are system-wide. This allows for many common steps to be taken in one centralized management location. Troubleshooting of TCP/IP layer of iSCSI is no different than the rest of TCP/IP that the IT staff is used to and standard debugging tools work. Troubleshooting the iSCSI layer is very similar to FC since they both essentially appear as SCSI and essentially offer the same services. Q. Are TOE cards required today? A. TOE cards are not required today. TCP Offload Engines (TOEs) have both advantages and disadvantages. TOEs are more expensive than ordinary non-TOE Network Interface Chips (NICs). But, TOEs reduce the CPU overhead involved with network traffic. In some workloads, the extra CPU overhead of a normal NIC is not a problem, but in other heavy network workloads, the extra CPU overhead of the normal NIC reduces the amount of work that the system is able to perform, and the TOE provides an advantage (by freeing up extra CPU cycles to perform real work). For 10Gb, you can do without an offload card if you have enough host CPU cycles at your disposal, or in the case of a target, if you are not aggregating too many initiators, or are not using SSDs and do not need the high IOPs. At 40Gb and above, you will likely need offload assist in your system. Q. Are queue depths the same for both FC and iSCSI? Or are there any differences? A. Conceptually, the concepts of queue depth are the same. At the SCSI layer, queue depth is the number of commands that a LUN may be concurrently operated on. When that number of outstanding commands is achieved, the LUN refuses to accept any additional commands (any additional commands are failed with the status of TASK SET FULL). As a SCSI layer concept, the queue depth is not impacted by the transport type (iSCSI or FC). There is no relationship between this value and concepts such as FC Buffer Credits, or iSCSI R2T (Ready to Transfer). In addition, some adapters have a limit on the number of outstanding commands that may be present at the adapter layer. As a result of interactions between the queue depth limits of an individual LUN, and the queue depths limits of the adapters, hosts often allow for administrative management of the queue depth. This management enables a system administrator to balance the IO load across LUNs so that a single busy LUN does not consume all available adapter resources. In this case, the queue depth value set at the host is used by the host as a limiter of the number of concurrent outstanding commands (rather than waiting for the LUN to report the TASK SET FULL status) Again, management of these queue depth values is independent of the transport type. However, on some hosts, the management of queue depth may appear different (for example, the commands used to set a maximum queue depth for a LUN on an FC transport vs. a LUN on an iSCSI transport may be different). Q. Is VMware happy more with FC or ISCSI, assuming almost the same speed? What about the network delay in contrast with the FC protocol which (is/was faster)? A. Unfortunately, we can’t comment on individual company’s best practice recommendations. However, you can refer to VMware’s Best Practices Guides for Fibre Channel and iSCSI: Best Practices for Fibre Channel Storage Best Practices For Running VMware vSphere on iSCSI Q. Does iSCSI have true load balancing when Ethernet links are aggregated? Meaning the links share even loads? Can it be done across switches? I’m trying to achieve load balancing and redundancy at the same time. A. In most iSCSI software as well as hardware offload implementations load-balancing is supported using “multi-pathing” between a server and storage array which provides the ability to load-balance between paths when all paths are present and to handle failures of a path at any point between the server and the storage. Multi-pathing is also a de facto standard for load-balancing and high-availability in most Fibre Channel SAN environments. Q. Between FC and iSCSI, what are the typical workloads for one or the other? A. It’s important to remember that both Fibre Channel and iSCSI are block storage protocols. That is, for applications and workloads that require block storage, both Fibre Channel and iSCSI are relevant. From a connectivity standpoint, there is not much difference between the protocols at a high level – you have an initiator in the host, a switch in the middle, and a storage target at the other end. What becomes important, then, is topologies and architectures. Fibre Channel has a tightly-controlled oversubscription ratio, which is the number of hosts that we allow to access a single storage device (ratios can fall between, typically 4:1 to 20:1, depending on the application). iSCSI, on the other hand, has a looser relationship with oversubscription ratios, and can often be several dozen to 1 storage target. Q. For IPSEC for iSCSI, are there hardware offload capabilities to do the encryption/decryption in iSCSI targets available, or is it all done in software? A. Both hardware offload and software solutions are available. The tradeoffs are typically cost. With a software solution, you pay the cost in extra overhead in the CPU. If your CPU is not already busy, then that cost is very low (you may not even notice). If however, your CPU is busy, then the overhead of IPSEC will slow down your application from getting real work done. With the hardware offload solution, the cost is the extra $$ to purchase the hardware itself. On the upside, the newest CPUs offer new instructions for reducing the overhead of the software processing of various security protocols. Chelsio’s T6 offers integrated IPSec and TLS offload. This encryption capability can be used either for data-at-rest purposes (independent of the network link), or can be used in conjunction with the iSCSI (but requires a special driver). The limitation of the special driver will be removed in the next generation. Q. For any of the instruction participants: Are there any dedicated FC/iSCSI detailed installation guides (for dummies) you use or recommend from any vendor? A. No, there isn’t a single set of installation guides, as the best practices vary by storage and network vendor. Your storage or network vendor is the best place to start. Q. If iSCSI is used in a shared mode, how is the performance? A. Assuming this refers to sharing the link (pipe), iSCSI software and hardware implementations may be configured to utilize a portion of the aggregate link bandwidth without affecting performance. Q. Any info on FCoE (Fibre Channel over Ethernet)? A. There are additional talks on FCoE available from the SNIA site: On-demand webcasts:

Blogs:

In summary, FCoE is an encapsulation of the FC protocol into Ethernet packets that are carried over an Ethernet wire (without the use of TCP or IP). Q. What is FC’s typical network latency in relation to storage access and compare to iSCSI? A. For hardware-offloaded iSCSI, the latency is essentially the same since both stacks are processed at silicon speeds. Q. With 400Gbps Ethernet on the horizon, cloud providers and enterprises adopting Hyper-converged architectures based on Ethernet, isn’t it finally death of FC, at least in the mainstream, with exception of some niche verticals, which also still run mainframes? A. No, tape is still with us, and its demise has been predicted for a long time. There are still good reasons for investing in FC; for example, sunk costs, traditional environments and applications, and the other advantages explained in the presentation. The above said, the ubiquity of the Ethernet ecosystem which drives features/performance/lower-cost has been and will continue to be a major challenge for FC. And so, the FC vs. iSCSI debate continues, Ready for another “Great Debate?” So are we, register now for our next live webcast “File vs. Block vs. Object” on April 17^th. We hope to see you there!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Fibre Channel iSCSI RDMA Storage Protocols

Blog

Ethernet Networked Storage – FAQ

Ethernet Networked Storage – FAQ

Fred Zhang

Dec 8, 2016

At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

Q. Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units. MPA enables packing of multiple small RDMA messages into a single Ethernet frame. It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

Q. What is the API for RDMA network IPC?

The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

Q. 100Gb Ethernet Optical Data Center solution?

A. 100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

Q. Where is FCoE being used today?

A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

Q. Why did iSCSI start out mostly in lower-cost SAN markets?

A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

Q. What are some of the differences between iSCSI and FCoE?

A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

Q. Which storage protocols support RDMA over Ethernet?

A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

100GbE Ethernet Storage FCoE iWARP NVMe NVMe over Fabrics RDMA RoCE

Subscribe to RDMA

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

Find a similar article by tags

Leave a Reply

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

Find a similar article by tags

Leave a Reply

RoCE vs. iWARP Q&A

Find a similar article by tags

Leave a Reply

RoCE vs. iWARP Q&A

Find a similar article by tags

Leave a Reply

RoCE vs. iWARP – The Next “Great Storage Debate”

Find a similar article by tags

Leave a Reply

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

Find a similar article by tags

Leave a Reply

FCoE vs. iSCSI vs. iSER: Get Ready for Another Great Storage Debate

Find a similar article by tags

Leave a Reply

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

Find a similar article by tags

Leave a Reply

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

Find a similar article by tags

Leave a Reply

Ethernet Networked Storage – FAQ

Find a similar article by tags

Leave a Reply