Blog

Centralized vs. Distributed Storage FAQ

Centralized vs. Distributed Storage FAQ

Oct 2, 2018

To date, thousands have watched our "Great Storage Debate" webcast series. Our most recent installment of this friendly debate (where no technology actually emerges as a "winner") was Centralized vs. Distributed Storage. If you missed it, it's now available on-demand. The live event generated several excellent questions which our expert presenters have thoughtfully answered here: Q. Which performs faster, centralized or distributed storage? A. The answer depends on the type of storage, the type of connections to the storage, and whether the compute is distributed or centralized. The stereotype is that centralized storage performs faster if the compute is local, that is if it's in the same data center as the centralized storage. Distributed storage is often using different (less expensive) storage media and designed for slower WAN connections, but it doesn't have to be so. Distributed storage can be built with the fastest storage and connected with the fastest networking, but it rarely is used that way. Also it can outperform centralized storage if the compute is distributed in a similar way to the distributed storage, letting each compute node access the data from a local node of the distributed storage. Q. What about facilities costs in either environment? Ultimately the data has to physically "land" somewhere and use power/cooling/floor space. There is an economy of scale in centralized data centers, how does that compare with distributed? A. One big difference is in the cost of power between various data centers. Typically, data centers tend to be the places where businesses have had traditional office space and accommodation for staff. Unfortunately, these are also areas of power scarcity and are consequently expensive to run. Distributed data centers can be in much cheaper locations; there are a number for instance in Iceland where geothermally generated electricity is very cheap, and environmental cooling is effectively free. Plus, the thermal cost per byte can be substantially lower in distributed data centers by efficiently packing drives to near capacity with compressed data. Learn more about data centers in Iceland here. Another difference is that distributed storage might consume less space if its data protection method (such as erasure coding) is more efficient than the data protection method used by centralized storage (typically RAID or triple replication). While centralized storage can also use erasure coding, compression, and deduplication, it's sometimes easier to apply these storage efficiency technologies to distributed storage. Q. What is sharding? A. Sharding is the process of breaking up, typically a database, into a number of partitions, and then putting these pieces or shards on separate storage devices or systems. The partitioning is normally a horizontal partition; that is, the rows of the database remain complete in a shard and some criteria (often a key range) is used to make each shard. Sharding is often used to improve performance, as the data is spread across multiple devices which can be accessed in parallel. Sharding should not be confused with erasure coding used for data protection. Although this also breaks data into smaller pieces and spreads it across multiple devices, each part of the data is encoded and can only be understood once a minimum number of the fragments have been read and the data has been reconstituted on some system that has requested it. Q. What is the preferred or recommended choice of NVMe over Fabrics (NVME-oF) for centralized vs. distributed storage systems for prioritized use-case scenarios such as data integrity, latency, number of retries for read-write/resource utilization? A. This is a straightforward cost vs. performance question. This kind of solution only makes sense if the compute is very close to the data; so either a centralized SAN, or a (well-defined) distributed system in one location with co-located compute would make sense. Geographically dispersed data centers or compute on remote data adds too much latency, and often bandwidth issues can add to the cost. Q. Is there a document that has catalogued the impact of latency on the many data types? When designing storage I would start with how much latency an application can withstand. A. We are not aware of any single document that has done so, but many applications (along with their vendors, integrators, and users) have documented their storage bandwidth and latency needs. Other documents show the impact of differing storage latencies on application performance. Generally speaking one could say the following about latency requirements, though exceptions exist to each one:

Block storage wants lower latency than file storage, which wants lower latency than object storage
Large I/O and sequential workloads tolerate latency better than small I/O and random workloads
One-way streaming media, backup, monitoring and asynchronous replication care more about bandwidth than latency. Two-way streaming (e.g. videoconferencing or IP telephony), database updates, interactive monitoring, and synchronous replication care more about latency than bandwidth.
Real-time applications (remote control surgery, multi-person gaming, remote AR/VR, self-driving cars, etc.) require lower latency than non-real-time ones, especially if the real-time interaction goes both ways on the link.

One thing to note is that many factors affect performance of a storage system. You may want to take a look at our excellent Performance Benchmark webinar series to find out more. Q. Computation faces an analogous debate between distributed compute vs. centralized compute. Please comment on how the computation debate relates to the storage debate. Typically, distributed computation will work best with distributed storage. Ditto for centralized computation and storage. Are there important applications where a user would go for centralized compute and distributed storage? Or distributed compute and centralized storage? A. That's a very good question, to which there is a range of not so very good answers! Here are some application scenarios that require different thinking about centralized vs. distributed storage. Video surveillance is best with distributed storage (and perhaps a little local compute to do things like motion detection or object recognition) with centralized compute (for doing object identification or consolidation of multiple feeds). Robotics requires lots of distributed compute; think self-driving cars, where the analysis of a scene and the motion of the vehicle needs to be done locally, but where all the data on traffic volumes and road conditions needs multiple data sources to be processed centrally. There are lots of other (often less exciting but just as important) applications that have similar requirements; retail food sales with smart checkouts (that part is all local) and stock management systems & shipping (that part is heavily centralized). In essence, sometimes it's easier to process the data where it's born, rather than move it somewhere else. Data is "sticky", and that sometimes dictates that the compute should be where the data lies. Equally, it's also true that the only way of making sense of distributed data is to centralize it; weather stations can't do weather forecasting, so it needs to be unstuck, collected up & transmitted, and then computed centrally.We hope you enjoyed this un-biased, vendor-neutral debate. You can check out the others in this series below:

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

ethernet

Blog

Centralized vs. Distributed Storage FAQ

Centralized vs. Distributed Storage FAQ

J Metz

Oct 2, 2018

To date, thousands have watched our “Great Storage Debate” webcast series. Our most recent installment of this friendly debate (where no technology actually emerges as a “winner”) was Centralized vs. Distributed Storage. If you missed it, it’s now available on-demand. The live event generated several excellent questions which our expert presenters have thoughtfully answered here: Q. Which performs faster, centralized or distributed storage? A. The answer depends on the type of storage, the type of connections to the storage, and whether the compute is distributed or centralized. The stereotype is that centralized storage performs faster if the compute is local, that is if it’s in the same data center as the centralized storage. Distributed storage is often using different (less expensive) storage media and designed for slower WAN connections, but it doesn’t have to be so. Distributed storage can be built with the fastest storage and connected with the fastest networking, but it rarely is used that way. Also it can outperform centralized storage if the compute is distributed in a similar way to the distributed storage, letting each compute node access the data from a local node of the distributed storage. Q. What about facilities costs in either environment? Ultimately the data has to physically “land” somewhere and use power/cooling/floor space. There is an economy of scale in centralized data centers, how does that compare with distributed? A. One big difference is in the cost of power between various data centers. Typically, data centers tend to be the places where businesses have had traditional office space and accommodation for staff. Unfortunately, these are also areas of power scarcity and are consequently expensive to run. Distributed data centers can be in much cheaper locations; there are a number for instance in Iceland where geothermally generated electricity is very cheap, and environmental cooling is effectively free. Plus, the thermal cost per byte can be substantially lower in distributed data centers by efficiently packing drives to near capacity with compressed data. Learn more about data centers in Iceland here. Another difference is that distributed storage might consume less space if its data protection method (such as erasure coding) is more efficient than the data protection method used by centralized storage (typically RAID or triple replication). While centralized storage can also use erasure coding, compression, and deduplication, it’s sometimes easier to apply these storage efficiency technologies to distributed storage. Q. What is sharding? A. Sharding is the process of breaking up, typically a database, into a number of partitions, and then putting these pieces or shards on separate storage devices or systems. The partitioning is normally a horizontal partition; that is, the rows of the database remain complete in a shard and some criteria (often a key range) is used to make each shard. Sharding is often used to improve performance, as the data is spread across multiple devices which can be accessed in parallel. Sharding should not be confused with erasure coding used for data protection. Although this also breaks data into smaller pieces and spreads it across multiple devices, each part of the data is encoded and can only be understood once a minimum number of the fragments have been read and the data has been reconstituted on some system that has requested it. Q. What is the preferred or recommended choice of NVMe over Fabrics (NVME-oF) for centralized vs. distributed storage systems for prioritized use-case scenarios such as data integrity, latency, number of retries for read-write/resource utilization? A. This is a straightforward cost vs. performance question. This kind of solution only makes sense if the compute is very close to the data; so either a centralized SAN, or a (well-defined) distributed system in one location with co-located compute would make sense. Geographically dispersed data centers or compute on remote data adds too much latency, and often bandwidth issues can add to the cost. Q. Is there a document that has catalogued the impact of latency on the many data types? When designing storage I would start with how much latency an application can withstand. A. We are not aware of any single document that has done so, but many applications (along with their vendors, integrators, and users) have documented their storage bandwidth and latency needs. Other documents show the impact of differing storage latencies on application performance. Generally speaking one could say the following about latency requirements, though exceptions exist to each one:

Block storage wants lower latency than file storage, which wants lower latency than object storage
Large I/O and sequential workloads tolerate latency better than small I/O and random workloads
One-way streaming media, backup, monitoring and asynchronous replication care more about bandwidth than latency. Two-way streaming (e.g. videoconferencing or IP telephony), database updates, interactive monitoring, and synchronous replication care more about latency than bandwidth.
Real-time applications (remote control surgery, multi-person gaming, remote AR/VR, self-driving cars, etc.) require lower latency than non-real-time ones, especially if the real-time interaction goes both ways on the link.

One thing to note is that many factors affect performance of a storage system. You may want to take a look at our excellent Performance Benchmark webinar series to find out more. Q. Computation faces an analogous debate between distributed compute vs. centralized compute. Please comment on how the computation debate relates to the storage debate. Typically, distributed computation will work best with distributed storage. Ditto for centralized computation and storage. Are there important applications where a user would go for centralized compute and distributed storage? Or distributed compute and centralized storage? A. That’s a very good question, to which there is a range of not so very good answers! Here are some application scenarios that require different thinking about centralized vs. distributed storage. Video surveillance is best with distributed storage (and perhaps a little local compute to do things like motion detection or object recognition) with centralized compute (for doing object identification or consolidation of multiple feeds). Robotics requires lots of distributed compute; think self-driving cars, where the analysis of a scene and the motion of the vehicle needs to be done locally, but where all the data on traffic volumes and road conditions needs multiple data sources to be processed centrally. There are lots of other (often less exciting but just as important) applications that have similar requirements; retail food sales with smart checkouts (that part is all local) and stock management systems & shipping (that part is heavily centralized). In essence, sometimes it’s easier to process the data where it’s born, rather than move it somewhere else. Data is “sticky”, and that sometimes dictates that the compute should be where the data lies. Equally, it’s also true that the only way of making sense of distributed data is to centralize it; weather stations can’t do weather forecasting, so it needs to be unstuck, collected up & transmitted, and then computed centrally.We hope you enjoyed this un-biased, vendor-neutral debate. You can check out the others in this series below:

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Data Storage ethernet

Blog

An Introduction: What is Swordfish?

An Introduction: What is Swordfish?

Barry Kittner

Oct 1, 2018

To understand Swordfish, let’s start with the basics to examine how modern data centers are managed.

A user of a PC/notebook is assumed to be in control of that PC. What happens when there are two? Or 10? Or 1,000? Today’s modern data centers can have 100,000 computers (servers) or more! That requires the ability to be in control or “manage” them from a central location. How does one do that? It is done via a protocol that enables remote management; today that standard is IPMI, an acronym for Intelligent Platform Management Interface, and which has existed for 20 years. Among issues with IPMI is that the scale of today’s data centers was not fully envisioned 20 years ago, so some of the components of IPMI cannot cover the tens of thousands of servers it is expected to manage. The developers also did not foresee the stringent security and increased privacy requirements expected in modern data centers.

The DMTF created, and continues to improve upon, a modern alternative standard for remote or centralized management of data centers called Redfish®. For those familiar with server management, Redfish is referred to as “schema-based,” meaning that engineers have carefully organized many different categories of information as well as the relationships between them. Schema are structured to manage the millions of bits of information and operating characteristics that data centers create and report on a continuous basis and that managers monitor to understand the status of the datacenter. In this way, information on the operational parameters of the machines in the data center is provided, when and where needed, in a consistent, organized and reliable way.

Unlike IPMI, the new Redfish standard uses modern tools, allowing it to scale to the size of today’s modern data centers. Redfish has output language readable by datacenter operators, works across the wide variety of servers and datacenter equipment that exists today, and is extensible for the new hardware of tomorrow.

The Storage Networking Industry Association (SNIA) is a global non-profit organization dedicated to developing standards and education programs to advance storage and information technology. SNIA created the Storage Management Initiative Specification (SMI-S) currently in use in datacenters to manage interoperable storage. SNIA immediately recognized the value of the new Redfish standard and created SNIA Swordfish™, which is an extension to Redfish that seamlessly manages storage equipment and storage services in addition to the server management of Redfish. Just as most PC’s have one or more storage devices, so do most servers in datacenters, and Swordfish can manage storage devices and allocation across all of the servers in a datacenter in the same structured and organized fashion.

A summary and additional information for the more technical readers is below. If you want to learn more, all the items underlined and in bold below yield more information. You can click them, or type them into your internet browser for more information on the terms used in this tutorial:

For security, Swordfish employs HTTPS, a well-known and well-tested protocol that is used for secure communications over the World Wide Web.
JavaScript and ODATA increase the readability, compatibility and integration of RESTful API’s that manage data collected from datacenter devices and covers a range of information useful for beginners through experienced engineers.
Interoperability exists due to the use of a common schema definition language (CSDL) and common APIs from eco-system partners including the Open Compute Project (OCP).
Redfish and Swordfish were created and are maintained by industry leaders that meet weekly to tune and extend management capabilities. (See DMTF.ORG, SNIA.ORG)
These schema work together to allow full network discovery, provisioning, volume mapping and monitoring of block, file and object storage for all the systems in a modern datacenter.

There is so much to learn beyond this brief tutorial. Start at DMTF.ORG to learn about Redfish. Then surf over to SNIA.ORG/SWORDFISH to see how Swordfish brings the benefits of schema-based management to all your storage devices. You will learn how Swordfish works in hyperscale and cloud infrastructure environments and enables a scalable solution that grows as your datacenter requirements grow.

By Barry Kittner, Technology Initiatives Manager, Intel and SNIA Storage Management Initiative Governing Board Member

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

DMTF Swordfish Storage Uncategorized

Blog

An Introduction: What is Swordfish?

An Introduction: What is Swordfish?

Diane Marsili

Oct 1, 2018

**Barry Kittner, Technology Initiatives Manager, Intel and SNIA Storage Management Initiative Governing Board Member**

To understand Swordfish, let’s start with the basics to examine how modern data centers are managed.

The Storage Networking Industry Association (SNIA) is a global non-profit organization dedicated to developing standards and education programs to advance storage and information technology. SNIA created the Storage Management Initiative Specification (SMI-S) currently in use in datacenters to manage interoperable storage. SNIA immediately recognized the value of the new Redfish standard and created SNIA Swordfish, which is an extension to Redfish that seamlessly manages storage equipment and storage services in addition to the server management of Redfish. Just as most PC’s have one or more storage devices, so do most servers in datacenters, and Swordfish can manage storage devices and allocation across all of the servers in a datacenter in the same structured and organized fashion.

For security, Swordfish employs HTTPS, a well-known and well-tested protocol that is used for secure communications over the World Wide Web.
JavaScript and ODATA increase the readability, compatibility and integration of RESTful API’s that manage data collected from datacenter devices and covers a range of information useful for beginners through experienced engineers.
Interoperability exists due to the use of a common schema definition language (CSDL) and common APIs from eco-system partners including the Open Compute Project (OCP).
Redfish and Swordfish were created and are maintained by industry leaders that meet weekly to tune and extend management capabilities. (See DMTF.ORG, SNIA.ORG)
These schema work together to allow full network discovery, provisioning, volume mapping and monitoring of block, file and object storage for all the systems in a modern datacenter.

By Barry Kittner, Technology Initiatives Manager, Intel and SNIA Storage Management Initiative Governing Board Member

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

DMTF Swordfish Storage

Blog

An Introduction: What is Swordfish?

An Introduction: What is Swordfish?

Diane Marsili

Oct 1, 2018

[caption id="attachment_1582" align="alignright" width="150"]

Barry Kittner, Technology Initiatives Manager, Intel and SNIA Storage Management Initiative Governing Board Member [/caption] To understand Swordfish, let’s start with the basics to examine how modern data centers are managed. A user of a PC/notebook is assumed to be in control of that PC. What happens when there are two? Or 10? Or 1,000? Today’s modern data centers can have 100,000 computers (servers) or more! That requires the ability to be in control or “manage” them from a central location. How does one do that? It is done via a protocol that enables remote management; today that standard is IPMI, an acronym for Intelligent Platform Management Interface, and which has existed for 20 years. Among issues with IPMI is that the scale of today’s data centers was not fully envisioned 20 years ago, so some of the components of IPMI cannot cover the tens of thousands of servers it is expected to manage. The developers also did not foresee the stringent security and increased privacy requirements expected in modern data centers. The DMTF created, and continues to improve upon, a modern alternative standard for remote or centralized management of data centers called Redfish®. For those familiar with server management, Redfish is referred to as “schema-based,” meaning that engineers have carefully organized many different categories of information as well as the relationships between them. Schema are structured to manage the millions of bits of information and operating characteristics that data centers create and report on a continuous basis and that managers monitor to understand the status of the datacenter. In this way, information on the operational parameters of the machines in the data center is provided, when and where needed, in a consistent, organized and reliable way. Unlike IPMI, the new Redfish standard uses modern tools, allowing it to scale to the size of today’s modern data centers. Redfish has output language readable by datacenter operators, works across the wide variety of servers and datacenter equipment that exists today, and is extensible for the new hardware of tomorrow. The Storage Networking Industry Association (SNIA) is a global non-profit organization dedicated to developing standards and education programs to advance storage and information technology. SNIA created the Storage Management Initiative Specification (SMI-S) currently in use in datacenters to manage interoperable storage. SNIA immediately recognized the value of the new Redfish standard and created SNIA Swordfish™, which is an extension to Redfish that seamlessly manages storage equipment and storage services in addition to the server management of Redfish. Just as most PC’s have one or more storage devices, so do most servers in datacenters, and Swordfish can manage storage devices and allocation across all of the servers in a datacenter in the same structured and organized fashion. A summary and additional information for the more technical readers is below. If you want to learn more, all the items underlined and in bold below yield more information. You can click them, or type them into your internet browser for more information on the terms used in this tutorial:

For security, Swordfish employs HTTPS, a well-known and well-tested protocol that is used for secure communications over the World Wide Web.
JavaScript and ODATA increase the readability, compatibility and integration of RESTful API’s that manage data collected from datacenter devices and covers a range of information useful for beginners through experienced engineers.
Interoperability exists due to the use of a common schema definition language (CSDL) and common APIs from eco-system partners including the Open Compute Project (OCP).
Redfish and Swordfish were created and are maintained by industry leaders that meet weekly to tune and extend management capabilities. (See DMTF.ORG, SNIA.ORG)
These schema work together to allow full network discovery, provisioning, volume mapping and monitoring of block, file and object storage for all the systems in a modern datacenter.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

DMTF Swordfish Storage

Blog

Dive into Computational Storage at Storage Developer Conference 2018 - A Chat with SNIA Technical Council Co-Chair Mark Carlson

Dive into Computational Storage at Storage Developer Conference 2018 - A Chat with SNIA Technical Council Co-Chair Mark Carlson

khauser

Sep 19, 2018

An update from SNIA on Storage: Computational Storage was definitely the "buzz" at SDC! Session slides can now be downloaded. The BoF attracted over 50 attendees, and a F2F meeting (and WebEx) is set for October 11 at the SNIA Technology Center in Colorado Springs. Go to snia.org/computational to sign up for the meeting to plan the mission and charter for the SNIA Computational Storage Technical Work Group. All SNIA and non-SNIA members are welcome to participate in this phase of the TWG. The SNIA Storage Developer Conference (SDC) is running September 24-27, 2018 at the Hyatt Regency Santa Clara CA. Registration is open, and the agenda is live! SNIA On Storage is teaming up with the SNIA Technical Council to dive into major themes of the 2018 conference. The SNIA Technical Council takes a leadership role to develop the content for each SDC, so SNIA on Storage spoke with Mark Carlson, SNIA Technical Council Co-Chair and Principal Engineer, Industry Standards, Toshiba Memory America, to understand why SDC is bringing Computational Storage to conference attendees. SNIA On Storage (SOS): Just in the last few weeks, there’s been a tremendous buzz about “computational storage”. Is this a new kid on the block? Mark Carlson (MC): We all know the classic architecture of a computer as a host with CPU and memory – and attached networking storage and peripherals like graphics and FPGAs often connected by a PCI Express bus. These systems operate at very high speeds with low latency and high throughput. Now, what would happen if we took some of the computational capabilities that are in a typical host, and put them on the other side of the PCIe bus? You’d have a “computational peripheral”. SOS: What could you do with a “computational peripheral”? MC: One use case is to treat the computational peripheral as an enhanced storage device on the PCIe bus. For example, even though this peripheral may not have any solid state storage on it, you could place it between the traditional host and a solid state drive. The computational peripheral could act as a compressor, taking uncompressed data from the host and compressing it as its being sent to the SSD. The system would not require either as many (or as large) SSDs. SOS: What if you wanted to do compression and decompression in the SSD? MC: A computational peripheral could be combined within the SSD to do compression/decompression within the drive itself. The advantage would be additional functionality from a single device, but the disadvantage would be that the computational resource could only be used for that SSD. Any data needing to be compressed would have to go to an SSD with computational capability, which would only be available when that SSD was being used. SOS: How else do you see computational storage being used? MC: Another useful application for computational storage is offload of encryption and decryption, which could address the numerous debates about whether this functionality should be in the drive or the host. Secure computational storage can create a trusted “security box” around the drive by optimizing for encryption and decryption. Another application is analytics offload – where you move the computation to the data rather than pulling it into a “host”. Because of the low latency, you would get results much faster. SOS: So why have we not seen computational storage in computer systems before now? MC: What enables computational storage is the PCIe bus and the NVMe standard. The NVMe interface allows you to move massive amounts of data across the PCIe bus, so leveraging NVMe and PCIe makes this whole ecosystem of computational storage possible. SOS: How will SNIA play into this? MC: SNIA has created a new provisional Technical Work Group or TWG to understand, educate, and perhaps develop standards for this new paradigm of computer architecture. Anyone can join this group at the link snia.org/computational. We have an open mailing list of over 40 companies to date to work on the charter and scope of work for the TWG. Once we form the TWG, you will need to be a SNIA member to join. You can contact Marty Foltyn for details on how to join SNIA. SOS: How can I learn more about computational storage? MC: if you are in Silicon Valley, the easiest way is to come to our open Birds of a Feather session at the SNIA Storage Developer Conference on Monday, September 24 at 7:00 pm in the Cypress room of the Hyatt Regency Santa Clara. We’ll have leaders from four companies involved in computational storage, along with TWG members, available for a lively discussion on where we are and where we need to go. No badge is needed, just come on by. SOS: Will you have a talk at SDC? MC: In my talk, Datacenter Management of NVMe Drives, I’ll be envisioning a computational storage peripheral doing SNIA Swordfish^TM management – tying into Swordfish access to all the PCIe and NVMe devices for computation and storage both to present a comprehensive view of any systems that were composed out of those peripherals. I’ll talk about what you could do with this – using Swordfish in-band to the NVMe device, for example. If the drive used Ethernet for NVMe, a regular http port could produce the Redfish storage schema to whoever wants to manage it that way. SOS: What other talks should I make sure to see? At SDC next week, we will have six other sessions touching on various topics about computational storage. On Monday, September 24, check out these three sessions:

FPGA Accelerator Disaggregation using NVMe over Fabrics will cover over fabrics connections allowing servers to share accelerators on demand.
Accelerating Storage with NVM Express SSDs and P2PDMA will show how these systems outperform their conventional counterparts and lead to lower cost and lower power designs.
A Comparison of In-storage Processing Architectures and Technologies will analyze the in-storage processing trend, compare different architectures, present a roadmap, and list application use cases.

On Tuesday, a general session on Compute and Storage Innovation Combine to Provide a Pathway to Composable Architecture will examine how best to tie the traditional software and hardware layers together for scale and use. Also, look for these sessions later in the day:

FPGA-Based ZLIB//GZIP Compression Engine as a NVMe Namespace will look at how off-loading compression from processors to FPGAs can free up valuable CPU time, reduce compression time and power consumption, and improve resource utilization and lower operation costs.
Deployment of In-storage Compute with NVMe Storage at Scale and Capacity will discuss a paradigm shift with In-Storage Compute, a simple, scalable, low power solution that provides developers with opportunities to have intelligent storage and break the boundaries still being hung onto via traditional rotating media architectures.

SOS: Looks like I have my week cut out for me, keeping up with computational storage at SDC! MC: Absolutely. I look forward to many interesting “hallway track” discussions every year!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Computational Storage Compute

Blog

RoCE vs. iWARP Q&A

RoCE vs. iWARP Q&A

John Kim

Sep 19, 2018

In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our "Great Storage Debate" webcast series. If you haven't seen it yet, it's now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all. Q. Are RDMA NIC's and TOE NIC's the same? What are the differences? A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb's API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don't use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS's (e.g. FreeBSD) support RoCE and iWARP? A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? A. The NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What's the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you've got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify -- RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a "Great Storage Debate?" Here is an area where there are two schools of thoughts. Answer #1: It's recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks? A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use "Resilient RoCE" without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it's own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so "Converged Ethernet" requirement doesn't apply to iWARP. Q. It's not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we'd likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000's of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It's unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn't the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn't require lossless but doesn't one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA "limited" but even if that's true don't they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP's performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor's implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other "Great Storage Debate" in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

ethernet RDMA

Blog

RoCE vs. iWARP Q&A

RoCE vs. iWARP Q&A

John Kim

Sep 19, 2018

In our RoCE vs. iWARP webcast, experts from the SNIA Ethernet Storage Forum (ESF) had a friendly debate on two commonly known remote direct memory access (RDMA) protocols that run over Ethernet: RDMA over Converged Ethernet (RoCE) and the IETF-standard iWARP. It turned out to be another very popular addition to our “Great Storage Debate” webcast series. If you haven’t seen it yet, it’s now available on-demand along with a PDF of the presentation slides. We received A LOT of questions related to Performance, Scalability and Distance, Multipathing, Error Correction, Windows and SMB Direct, DCB (Data Center Bridging), PFC (Priority Flow Control), lossless networks, and Congestion Management, and more. Here are answers to them all. Q. Are RDMA NIC’s and TOE NIC’s the same? What are the differences? A. No, they are not, though some RNICs include a TOE. An RNIC based on iWARP uses a TOE (TCP Offload Engine) since iWARP itself is fundamentally an upper layer protocol relative to TCP/IP (encapsulated in TCP/IP). The iWARP-based RNIC may or may not expose the TOE. If the TOE is exposed, it can be used for other purposes/applications that require TCP/IP acceleration. However, most of the time, the TOE is hidden under the iWARP verb’s API and thus is only used to accelerate TCP for iWARP. An RNIC based on RoCE usually does not have a TOE in the first place and is thus not capable of statefully offloading TCP/IP, though many of them do offer stateless TCP offloads. Q. Does RDMA use the TCP/UDP/IP protocol stack? A. RoCE uses UDP/IP while iWARP uses TCP/IP. Other RDMA protocols like OmniPath and InfiniBand don’t use Ethernet. Q. Can Software Defined Networking features like VxLANs be implemented on RoCE/iWARP NICs? A. Yes, most RNICs can also support VxLAN. An RNIC combined all the functionality of a regular NIC (like VxLAN offloads, checksum offloads etc.) along with RDMA functionality. Q. Do the BSD OS’s (e.g. FreeBSD) support RoCE and iWARP? A. FreeBSD supports both iWARP and RoCE. Q. Any comments on NVMe over TCP? A. The NVMe over TCP standard is not yet finalized. Once the specification is finalized SNIA ESF will host a webcast on BrightTALK to discuss NVMe over TCP. Follow us @SNIAESF for notification of all our upcoming webcasts. Q. What layers in the OSI model would the RDMAP, DDP, and MPA map to for iWARP? A. RDMAP/DDP/MPA are stacking on top of TCP, so these protocols are sitting on top of Layer 4, Transportation Layer, based on the OSI model. Q. What’s the deployment percentages between RoCE and iWARP? Which has a bigger market share support and by how much? A. SNIA does not have this market share information. Today multiple networking vendors support both RoCE and iWARP. Historically more adapters supporting RoCE have been shipped than adapters supporting iWARP, but not all the iWARP/RoCE-capable Ethernet adapters deployed are used for RDMA. Q. Who will win RoCE or iWARP or InfiniBand? What shall we as customers choose if we want to have this today? A. As a vendor-neutral forum, SNIA cannot recommend any specific RDMA technology or vendor. Note that RoCE and iWARP run on Ethernet while InfiniBand (and OmniPath) do not use Ethernet. Q. Are there any best practices identified for running higher-level storage protocols (iSCSI/NFS/SMB etc.), on top of RoCE or iWARP? A. Congestion caused by dropped packets and retransmissions can degrade performance for higher-level storage protocols whether using RDMA or regular TCP/IP. To prevent this from happening a best practice would be to use explicit congestion notification (ECN), or better yet, data center bridging (DCB) to minimize congestion and ensure the best performance. Likewise, designing a fully non-blocking network fabric will also assist in preventing congestion and guarantee the best performance. Finally, by prioritizing the data flows that are using RoCE or iWARP, the network administrators can ensure bandwidth is available for the flows that require it the most. iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration as well as out-of-the-box support across LAN/MAN/WAN networks. Q. On slide #14 of RoCE vs. iWARP presentation, the slide showed SCM being 1,000 times faster than NAND Flash, but the presenter stated 100 times faster. Those are both higher than I have heard. Which is the correct? A. Research on the Internet shows that both Intel and Micron have been boasting that 3D XPoint Memory is 1,000 times as fast as NAND flash. However, their tests also compared standard NAND flash based PCIe SSD to a similar SSDs based on 3D XPoint memory which was only 7-8 times faster. Due to this, we dug in a little further and found a great article by Jim Handy Why 3D XPoint SSDs Will Be Slow that could help explain the difference. Q. What is the significance of BTH+ and GRH header? A. BTH+ and GRH are both used within InfiniBand for RDMA implementations. With RoCE implementations of RDMA, packets are marked with EtherType header that indicates the packets are RoCE and ip.protocol_number within the IP Header is used to indicate that the packet is UDP. Both of these will identify packets as RoCE packets. Q. What sorts of applications are unique to the workstation market for RDMA, versus the server market? A. All major OEM vendors are shipping servers with CPU platforms that include integrated iWARP RDMA, as well as offering adapters that support iWARP and/or RoCE. Main applications of RDMA are still in the server area at this moment. At the time of this writing, workstation operating systems such as Windows 10 or Linux can use RDMA when running I/O-intensive applications such as video post-production, oil/gas and computer-aided design applications, for high-speed access to storage. DCB, PFC, lossless networks, and Congestion Management Q. Is slide #26 correct? I thought RoCE v1 was PFC/DCB and RoCE v2 was ECN/DCB subset. Did I get it backwards? A. Sorry for the confusion, you’ve got it correct. With newer RoCE-capable adapters, customers may choose to use ECN or PFC for RoCE v2. Q. I thought RoCE v2 did not need any DCB enabled network, so why this DCB congestion management for RoCE v2? A. RoCEv2 running on modern rNICs is known as Resilient RoCE due to it not needing a lossless network. Instead a RoCE congestion control mechanism is used to minimize packet by leveraging Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. RoCE v2 takes advantage of ECN to avoid congestion and packet loss. ECN-capable switches detect when a port is getting too busy and mark outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission (referred to as resilient RoCE). Q. Is iWARP a lossless or losssy protocol? A. iWARP utilizes the underlying TCP/IP layer for loss resilience. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality. Q. So it looks to me that iWARP can use an existing Ethernet network without modifications and RoCEv2 would need some fine-tuning. Is this correct? A. Generally iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE). However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion. iWARP delivers RDMA on top of the TCP/IP protocol and thus TCP provides congestion management and loss resilience for iWARP which, as a result, does not require a lossless Ethernet network. This is particularly useful in congested networks or long distance links. Q. Is this correct statement? Please clarify — RoCE v1 requires ECN, PFC but RoCEv2 requires only ECN or PFC? A. Remember, we called this presentation a “Great Storage Debate?” Here is an area where there are two schools of thoughts. Answer #1: It’s recommended to deploy RoCE (v1) with PFC which is part of the Ethernet Data Center Bridging (DCB) specification to implement a lossless network. With the release of RoCEv2, an alternative mechanism to avoid packet loss was introduced which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs. Answer #2: Generally this is correct, iWARP does not require any modification to the Ethernet switches and RoCE requires the use of either PFC or ECN (depending on the rNICs used for RoCE), and DCB. As such, and this is very important, an iWARP installation of a storage or server node is decoupled from the switch infrastructure upgrade. However, all RDMA networking will benefit from a network setup that minimizes latency, packet loss, and congestion, though in the case of an iWARP adapter, this benefit is insignificant, since all loss recovery and congestion management happen at the silicon speed of the underlying TOE. Q. Does RoCE v2 also require PFC or how will it handle lossy networks? A. RoCE v2 does not require PFC but performs better with having either PFC or ECN activated. See the following question and answer for more details. Q. Can a RoCEv2 lossless network be achieved with ECN only (no PFC)? A. RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network. With modern RoCE-capable adapters, it only requires the use of ECN. ECN in of itself does not guarantee a lossless connection but can be used to minimize congestion and thus minimize packet loss. However, even with RoCE v2, a lossless connection (using PFC/DCB) can provide better performance and is often implemented with RoCEv2 deployments, either instead of ECN or alongside ECN. Q. In order to guarantee lossless, does ECN and PFC both have to be used? A. ECN can be used to avoid most packet loss, but PFC (part of DCB) is required for a truly lossless network. Q. Are there real deployments that use “Resilient RoCE” without PFC configured? A. To achieve better performance, PFC alone or both ECN and PFC are deployed in most iterations of RoCE in real deployments today. However, there are a growing number of deployments using Resilient RoCE with ECN alone that maintain high levels of performance. Q. For RoCEv2, can ECN be implemented without PFC? A. Yes, ECN can be implemented on it’s own within a RoCE v2 implementation without the need for PFC. Q. RoCE needs to have converged Ethernet, but no iWARP, correct? A. Correct. iWARP was standardized in IETF and built upon standard TCP/IP over Ethernet, so “Converged Ethernet” requirement doesn’t apply to iWARP. Q. It’s not clear from the diagram if TCP/IP is still needed for RoCE and iWARP. Is it? A. RoCE uses IP (UDP/IP) but not TCP. IWARP uses TCP/IP. Q. On slide #10, does this require any support on the switch? A. Yes, an enterprise switch with support for DCB would be required. Most enterprise switches do support DCB today. Q. Will you cover congestion mechanisms and which one ROCEv2 or iWARP work better for different workloads? A. With multiple vendors supporting RoCEv2 and iWARP at different speeds (10, 25, 40, 50, and 100Gb/s), we’d likely see a difference in performance from each adapter across different workloads. An apples-to-apples test of the specific workload would be required to provide an answer. If you are working with a specific vendor or OEM, we would suggest you ask the vendor/OEM for comparison data on the workload you plan on deploying. Performance, Scalability and Distance Q. For storage related applications, could you add a performance based comparison of Ethernet based RoCE / iWARP to FC-NVMe with similar link speeds (32Gbps FC to 40GbE for example)? A. We would like to see the results of this testing as well and due to the overwhelming request for data representing RoCE vs. iWARP this is something we will try to provide in the future. Q. Do you have some performance measurements which compare iWARP and RoCE? A. Nothing is available from SNIA ESF but a search on Google should provide you with the information you are looking for. For example, you can find this Microsoft blog. Q. Are there performance benchmarks between RoCE vs. iWARP? A. Debating which one is faster is beyond the scope of this webcast. Q. Can RoCE scale to 1000’s of Ceph nodes, assuming each node hosts 36 disks? A. RoCE has been successfully tested with dozens of Ceph nodes. It’s unknown if RoCE with Ceph can scale to 1000s of Ceph nodes. Q. Is ROCE limited in number of hops? A. No, there is no limit in the amount of hops, but as more hops are included, latencies increase and performance may become an issue. Q. Does RoCEv2 support long distance (100km) operation or is it only iWARP? A. Today the practical limit of RoCE while maintaining high performance is about 40km. As different switches and optics come to market, this distance limit may increase in the future. iWARP has no distance limit but with any high-performance networking solution, increasing distance leads to increasing latency due to the speed of light and/or retransmission hops. Since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be. Multipathing, Error Correction Q. Isn’t the Achilles heel of iWARP the handling of congestion on the switch? Sure TCP/IP doesn’t require lossless but doesn’t one need DCTCP, PFC, ETS to handle buffers filling up both point to point as well as from receiver to sender? Some vendors offload any TCP/IP traffic and consider RDMA “limited” but even if that’s true don’t they have to deal with the same challenges on the switch in regards to congestion management? A. TCP itself uses a congestion-avoidance algorithm, like TCP New Reno (RFC 6582), together with slow start and congestion window to avoid congestions. These mechanisms are not dependent on switches. So iWARP’s performance under network congestion should closely match that of TCP. Q. If you are using RoCE v2 with UDP, how is error correction implemented? A. Error correction is done by the RoCE protocol running on top of UDP. Q. How does multipathing works with RDMA? A. For single-port RNICs, multipathing, being network-based (Equal-cost Multi-path routing, ECMP) is transparent to the RDMA application. Both RoCE and iWARP transports achieve good network load balancing under ECMP. For multi-port RNICs, the RDMA client application can explicitly load-balance its traffic across multiple local ports. Some multi-port RNICs support link aggregation (a.k.a. bonding), in which case the RNIC transparently spreads connection load amongst physical ports. Q. Do RoCE and iWARP work with bonded NICs? A. The short answer is yes, but it will depend on individual NIC vendor’s implementation. Windows and SMB Direct Q. What is SMB Direct? A. SMB Direct is a special version of the SMB 3 protocol. It supports both RDMA and multiple active-active connections. You can find the official definition of SMB (Server Message Block) in the SNIA Dictionary. Q. Is there iSER support in Windows? A. Today iSER is supported in Linux and VMware but not in Windows. Windows does support both iWARP and RoCE for SMB Direct. Chelsio is now providing an iSER (iWARP) Initiator for Windows as part of the driver package, which is available at service.chelsio.com. The current driver is considered a beta, but will go GA by the end of September 2018. Q. When will iWARP or RoCE for NVMe-oF be supported on Windows? A. Windows does not officially support NVMe-oF yet, but if and when Windows does support it, we believe it will support it over both RoCE and iWARP. Q. Why is iWARP better for Storage Spaces Direct? A. iWARP is based on TCP, which deals with flow control and congestion management, so iWARP is scalable and ideal for a hyper-converged storage solution like Storage Spaces Direct. iWARP is also the recommended configuration from Microsoft in some circumstances. We hope that answers all your questions! We encourage you to check out the other “Great Storage Debate” in this webcast series. To date, our experts have had friendly, vendor-neutral debates on File vs. Block vs. Object Storage, Fibre Channel vs. iSCSI, FCoE vs. iSCSI vs. iSER and Centralized vs. Distributed Storage. Happy debating!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

ethernet RDMA RoCE

Blog

Dive into NVMe at Storage Developer Conference - a Chat with SNIA Technical Council Co-Chair Bill Martin

Dive into NVMe at Storage Developer Conference - a Chat with SNIA Technical Council Co-Chair Bill Martin

khauser

Sep 12, 2018

The SNIA Storage Developer Conference (SDC) is coming up September 24-27, 2018 at the Hyatt

Regency Santa Clara CA. The agenda is now live! SNIA on Storage is teaming up with the SNIA Technical Council to dive into major themes of the

2018 conference. The SNIA Technical Council takes a leadership role to develop the content for each SDC, so SNIA on Storage spoke with Bill Martin, SNIA Technical Council Co-Chair and SSD I/O Standards at Samsung Electronics, to understand why SDC is bringing NVMe and NVMe-oF to conference attendees. SNIA On Storage (SOS): What is NVMe and why is SNIA emphasizing it as one of their key areas of focus for SDC? Bill Martin (BM): NVMe^TM, also known as NVM Express^R, is an open collection of standards and information to fully expose the benefits of non-volatile memory (NVM) in all types of computing environments from mobile to data center. SNIA is very supportive of NVMe. In fact, earlier this year, SNIA, the Distributed Management Task Force (DMTF), and the NVM Express organizations formed a new alliance to coordinate standards for managing solid state drive (SSD) storage devices. This alliance brings together multiple standards for managing the issue of scale-out management of SSDs. It’s designed to enable an all-inclusive management experience by improving the interoperable management of information technologies. With interest both from within and outside of SNIA from architects, developers, and implementers on how these standards work, the SNIA Technical Council decided to bring even more sessions on this important area to the SDC audience this year. We are proud to include 16 sessions on NVMe topics over the four days of the conference. SOS: What will I learn about NVMe at SDC? Performance is always of interest to storage developers. We’ll have a session on accelerating storage with NVMe SSDs, and several others on the Storage Performance Development Kit (SPDK), with a historical perspective on key design decisions and a discussion on the driver’s advantages and limitations. In other sessions, attendees can get an update on how the Fibre Channel industry is combining the lossless, highly deterministic nature of Fibre Channel with NVMe, and learn how to deploy in-storage computing with the NVMe interface. SOS: I’ve heard a lot recently about something called NVMe-oF. What is that and are there any sessions on it at SDC? BM: NVMe-oF is an acronym for NVMe over Fabrics, and SDC will cover how both the NVMe interface and NVMe-oF underlying protocols (such as RoCE, iWARP and FC) provide a highly efficient access to flash storage. Attendees will learn how architects are rethinking Ceph architecture for disaggregation using NVMe over Fabrics, how to manage storage services using NVMe-oF based on SNIA Swordfish^TMand DMTF Redfish^TM management interfaces, and how to deliver scalable distributed block storage using NVMe-oF. SOS: I see in the SDC agenda that you’ll be speaking on Key Value Storage. Can you explain this topic and how SNIA is involved in this area? BM: A number of applications and companies are now contemplating using a different type of storage interface. The NVMe Key Value project is a proposal for a new command structure to access data on an NVMe controller. This proposed command set provides a key and a value to store data on the Non-Volatile media and provides a key to retrieve data stored on the media. The primary interface work is being done within the NVMe technical working group. What SNIA is doing within their Object Drive Technical Work Group is an application programming interface, known as an API. We expect to have a document ready for review by SDC, and my session on Monday September 24 at 2:30 pm will discuss where we are in the SNIA standardization process of a Key Value API. SOS: How can I get “prepped” on NVMe before SDC? BM: I would check out two videos on NVMe on the SNIA Video Channel, both by SNIA Board Member J Metz. One is a snapshot on NVMe, NVMe-oF, and Ethernet Storage with directions to more resources: https://www.youtube.com/watch?time_continue=11&v=yh68WIj68XI The other is his presentation from Storage Field Day on SNIA and NVMe-over-Fabrics https://www.youtube.com/watch?v=HfcZwkPzj4w Another great resource is the SNIA Podcasts on NVMe and many other topics – located at https://www.snia.org/events/storage-developer/podcasts SOS: Any final thoughts on NVMe? BM: On behalf of the SNIA Technical Council, I would like to invite everyone at SDC and in the Silicon Valley technology community to come to the Birds of a Feather session on Tuesday September 25 at 7:00 pm and join the local Bay Area NVMe Meetup group for an open discussion on NVMe technology. We’ll have leading experts from Cisco Systems, Eideticom, Samsung, SK hynix, and Toshiba Memory on hand to discuss some of their latest projects and answer questions on implementing or using solutions based on the NVMe standard. This session is open to anyone interested in NVMe and does not require an SDC badge, so invite your colleagues to join us in the Stevens Creek room at the Hyatt Regency Santa Clara. SOS: Thanks for your time, Bill, and see you at SDC!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

NVMe NVMe-oF RoCE SDC Swordfish

Blog

We’re Debating Again: Centralized vs. Distributed Storage

We’re Debating Again: Centralized vs. Distributed Storage

J Metz

Sep 4, 2018

We hope you’ve been following the SNIA Ethernet Storage Forum (ESF) “Great Storage Debates” webcast series. We’ve done four so far and they have been incredibly popular with 4,000 live and on-demand views to date and counting. Check out the links to all of them at the end of this blog. Although we have “versus” in the title of these presentations, the goal of this series is not to have a winner emerge, but rather provide a “compare and contrast” that educates attendees on how the technologies work, the advantages of each, and to explore common use cases. That’s exactly what we plan to do on September 11, 2018 when we host “Centralized vs. Distributed Storage.” In the history of enterprise storage there has been a trend to move from local storage to centralized, networked storage. Customers found that networked storage provided higher utilization, centralized and hence cheaper management, easier failover, and simplified data protection amongst many advantages, which drove the move to FC-SAN, iSCSI, NAS and object storage. Recently, however, distributed storage has become more popular where storage lives in multiple locations, but can still be shared over a LAN (Local Area Network) and/or WAN (Wide Area Network). The advantages of distributed storage include the ability to scale out capacity. Conversely, in the hyperconverged use case, enterprises can use each node for both compute and storage, and scale-up as more resources are needed. What does this all mean? Register for this live webcast to find out, where my ESF colleagues and I will discuss:

Pros and cons of centralized vs. distributed storage
Typical use cases for centralized and distributed storage
How SAN, NAS, parallel file systems, and object storage fit in these different environments
How hyperconverged has introduced a new way of consuming storage

It’s sure to be another un-biased, vendor-neutral look at a storage topic many are debating within their own organizations. I hope you’ll join us on September 11^th. In the meantime, I encourage you to watch our on-demand debates:

Learn about the work SNIA is doing to lead the storage industry worldwide in developing and promoting vendor-neutral architectures, standards, and educational services that facilitate the efficient management, movement, and security of information by visiting snia.org.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Data Storage ethernet Hyperconverged Infrastructure iSCSI Object Storage

Subscribe to

Centralized vs. Distributed Storage FAQ

Find a similar article by tags

Leave a Reply

Centralized vs. Distributed Storage FAQ

Find a similar article by tags

Leave a Reply

An Introduction: What is Swordfish?

Find a similar article by tags

Leave a Reply

An Introduction: What is Swordfish?

Find a similar article by tags

Leave a Reply

An Introduction: What is Swordfish?

Find a similar article by tags

Leave a Reply

Dive into Computational Storage at Storage Developer Conference 2018 - A Chat with SNIA Technical Council Co-Chair Mark Carlson

Find a similar article by tags

Leave a Reply

RoCE vs. iWARP Q&A

Find a similar article by tags

Leave a Reply

RoCE vs. iWARP Q&A

Find a similar article by tags

Leave a Reply

Dive into NVMe at Storage Developer Conference - a Chat with SNIA Technical Council Co-Chair Bill Martin

Find a similar article by tags

Leave a Reply

We’re Debating Again: Centralized vs. Distributed Storage

Find a similar article by tags

Leave a Reply