Sorry, you need to enable JavaScript to visit this website.

Ethernet Networked Storage – FAQ

Fred Zhang

Dec 8, 2016

title of post

At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

Q. Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units.  MPA enables packing of multiple small RDMA messages into a single Ethernet frame.  It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

Q. What is the API for RDMA network IPC?

The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

Q. 100Gb Ethernet Optical Data Center solution?

A. 100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

Q. Where is FCoE being used today?

A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

Q. Why did iSCSI start out mostly in lower-cost SAN markets?

A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

Q. What are some of the differences between iSCSI and FCoE?

A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

Q. Which storage protocols support RDMA over Ethernet?

A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

 

 

 

 

 

 

 

 

 

 

 

 

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage Training Your Way - Education for 21st Century Professional Development

khauser

Dec 6, 2016

title of post
paul_talbut_snia By Paul Talbut, SNIA Global Education and Regional Affiliate Program Director It is widely accepted that those who are considered knowledge workers (those who use a screen or the internet as a part of their daily work routine) face constant disruption and distractions. The constant flow of emails, online news feeds, social media, and personal interests tend to draw people away from the need to concentrate on the job at hand. Recent studies suggest that workers are interrupted on average every five minutes, ironically often by work applications or collaboration tools. If this is happening to the work function, then it is not surprising that the opportunity to undergo training or personal development, where focus and concentration without distraction is key to learning, is severely restricted. The same studies suggest that around 1% of the working week is all that workers have to devote to focus on training and development.typorama This has led to a dramatic shift in the way people consume their training content. It is no longer practical or cost effective to lock people away in a training room for five days. Any such educational value is now much more likely to be absorbed on the train or the bus on the way to work, and so the nature of the content delivery has to change. Educational content needs to be delivered in modules and in a variety of formats to match the plethora of personal devices and platforms available today. Even without the formality of instructional design and a comprehensive curriculum, content such as podcasts, webcasts, and training materials need to be accessible on the move and in such bite-size chunks that we can give people the information they need and make it available at the times that suit them best.Education_continuum_new_resize SNIA is changing the way study guides and materials are made available to our constituents by collaborating with our members and training partners to focus on a wide variety of educational channels. Materials and study guides are now available via e-book, PDF, YouTube, BrightTALK webcasts, podcasts, and online instructor led courses. The challenge now is to make the content compelling and attractive enough to compete with all the other digital content available for consumption, and provide the opportunity to learn something new about storage rather than watching ice-buckets, mannequins or the latest cute cat. To date, SNIA has certified over 12,000 storage professionals worldwide, and our vendor-neutral certification program continues to be the industry leader in the independent assessment of storage technology skills.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

No Shortage of Container Storage Questions

Chad Hintz

Nov 29, 2016

title of post
We covered a lot of ground in out recent SNIA Ethernet Storage Forum webcast, "Current State of Storage in the Container World." We had a technical discussion on why containers are so compelling, how Docker containers work, persistent shared storage and future considerations for container storage. We received some great questions during the live event, and as promised, here are answers to them all. Q. Docker cannot be installed on bare metal and requires a base OS to operate upon right? A. That is correct. Q. Does the application code need to be changed so that it can "fit and operate" in a container? A. No, the application code does not need to change. The challenge most people face when migrating an application to a container is how to maintain the application's state. One of the motivations for this webcast was to explain how to allow applications within containers to persist data. Hopefully the Docker Volume construct will meet your needs. Q. Seems like containers share one OS/kernel... That suggests that there is just one OS in the "containerized" server... And yet there is still mention of hypervisor (or at least Hyper-V)... Can you clarify? If the containers share an OS, is a hypervisor needed? A. You are correct, containers are designed to share a single kernel; therefore a hypervisor is not required to run containers. Having said that, VMware and Microsoft both offer options that run a single container in its own virtual machine (running a minimal operating system). Q. Can the Docker Hub be compared to something like the GitHub? A. Yes, that is a great analogy. Docker Hub (hub.docker.com) is to container images as GitHub (github.com) is to source code. Q. What are the differences between the base and the host image? A. If you're referring to the webcast slides; the box labeled "Base Image" is the first layer in an image. The box labeled "Host OS" is not a layer, but represents the hosting operating system (kernel) that is shared by the containers. Q. So there is a separate root per container? A. In most cases the image will provide a root, therefore each container will have a separate root. This is made possible by a kernel feature called namespaces. Alternatively, Docker does allow you to share a directory between the host operating system and any number of containers though. Q. If Deduplication is enabled on the storage LUNs, won't that affect the performance of the containers? A. Well implemented data reduction features (compression and deduplication) should have little to no effect on performance and should provide significant benefit by reducing the space required to store containers. Q. Can you please quickly review the concept of copy-on-write with one or two sentences to boil it down? A. How the copy-on-write works depends on whether the driver is file or block based. For the sake of simplicity, let's assume a file-based implementation. Since the image layers are read-only, we need an area to store the changes that the container has made. This area is the copy-on-write layer. When a process reads a file that has not been modified, the file is read from one of the read only layers. When that file is modified and needs to be written back to disk, the new file is written to the copy-on-write layer as is the metadata that describes the file. The next time this file is read, it is read from copy-on-write layer. The graph driver is responsible for this functionality and varies by implementation. Q. Can network locations be used for /data? If yes, how does the Docker Engine manage network authentication for the driver? A. Yes, network locations can be used. The best practice is to use the Local Volume Driver, where you can pass in the required authentication via the options (see slide 15). Alternatively, the network location can be mounted on the host operating system and exposed to containers (see slides 21 & 22). Q. Is this where VAAI like primitives would get implemented? A. VAAI defines several in-band primitives.   The Docker Volume plug-in framework is completely out-of-band.   There can be some overlap in features though.   For example, the XCOPY primitive can be used to offload ‘copy jobs' to an array.   If the vendor chooses to do so, a ‘copy job' can be offloaded through the Docker Volume plug-in as well.   For example, a plug-in might implement a "clone" option that provides this service. Q. Could you share some details about Kubernetes storage ? Persistent volumes and the difference from Docker volumes? Also, what is your perspective of Flocker? A. Kubernetes has the concept of persistent storage. This abstraction is also called a volume. In addition, Kubernetes provides a plug-in option as well. The Kubernetes implementation predates the Docker Volume and is currently not compatible. Q. Comment on mainframe: IBM runs Linux on zSeries, therefore can run Linux Docker containers. A. Thanks, that's good to know. Q. How many operating systems changes on the x86 platform? How many on the mainframe platform? Can x86 architecture run the same code/OS from 40 years ago? Docker on mainframe? A. The mainframe architecture has been very solid and consistent for many years. Q. What is a big challenge for storage in container environment? A. I don't think storage has a challenge in the container environment. I think, with a properly implemented Docker Volume Plug-in, storage provides a solution to the persistent shared storage need in a container environment. Q. Do you ever look into RexRay or VMDK storage drivers? A. Yes, these are both examples of Docker Volume plug-in implementations. Update: If you missed the live event, it's now available  on-demand. You can also  download the webcast slides.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

No Shortage of Container Storage Questions

Chad Hintz

Nov 29, 2016

title of post

We covered a lot of ground in out recent SNIA Ethernet Storage Forum webcast, “Current State of Storage in the Container World.” We had a technical discussion on why containers are so compelling, how Docker containers work, persistent shared storage and future considerations for container storage. We received some great questions during the live event, and as promised, here are answers to them all.

Q. Docker cannot be installed on bare metal and requires a base OS to operate upon right?

A. That is correct.

Q. Does the application code need to be changed so that it can “fit and operate” in a container?

A. No, the application code does not need to change. The challenge most people face when migrating an application to a container is how to maintain the application’s state. One of the motivations for this webcast was to explain how to allow applications within containers to persist data. Hopefully the Docker Volume construct will meet your needs.

Q. Seems like containers share one OS/kernel… That suggests that there is just one OS in the “containerized” server… And yet there is still mention of hypervisor (or at least Hyper-V)… Can you clarify? If the containers share an OS, is a hypervisor needed?

A. You are correct, containers are designed to share a single kernel; therefore a hypervisor is not required to run containers. Having said that, VMware and Microsoft both offer options that run a single container in its own virtual machine (running a minimal operating system).

Q. Can the Docker Hub be compared to something like the GitHub?

A. Yes, that is a great analogy. Docker Hub (hub.docker.com) is to container images as GitHub (github.com) is to source code.

Q. What are the differences between the base and the host image?

A. If you’re referring to the webcast slides; the box labeled “Base Image” is the first layer in an image. The box labeled “Host OS” is not a layer, but represents the hosting operating system (kernel) that is shared by the containers.

Q. So there is a separate root per container?

A. In most cases the image will provide a root, therefore each container will have a separate root. This is made possible by a kernel feature called namespaces. Alternatively, Docker does allow you to share a directory between the host operating system and any number of containers though.

Q. If Deduplication is enabled on the storage LUNs, won’t that affect the performance of the containers?

A. Well implemented data reduction features (compression and deduplication) should have little to no effect on performance and should provide significant benefit by reducing the space required to store containers.

Q. Can you please quickly review the concept of copy-on-write with one or two sentences to boil it down?

A. How the copy-on-write works depends on whether the driver is file or block based. For the sake of simplicity, let’s assume a file-based implementation. Since the image layers are read-only, we need an area to store the changes that the container has made. This area is the copy-on-write layer. When a process reads a file that has not been modified, the file is read from one of the read only layers. When that file is modified and needs to be written back to disk, the new file is written to the copy-on-write layer as is the metadata that describes the file. The next time this file is read, it is read from copy-on-write layer. The graph driver is responsible for this functionality and varies by implementation.

Q. Can network locations be used for /data? If yes, how does the Docker Engine manage network authentication for the driver?

A. Yes, network locations can be used. The best practice is to use the Local Volume Driver, where you can pass in the required authentication via the options (see slide 15). Alternatively, the network location can be mounted on the host operating system and exposed to containers (see slides 21 & 22).

Q. Is this where VAAI like primitives would get implemented?

A. VAAI defines several in-band primitives.  The Docker Volume plug-in framework is completely out-of-band.  There can be some overlap in features though.  For example, the XCOPY primitive can be used to offload ‘copy jobs’ to an array.  If the vendor chooses to do so, a ‘copy job’ can be offloaded through the Docker Volume plug-in as well.  For example, a plug-in might implement a “clone” option that provides this service.

Q. Could you share some details about Kubernetes storage ? Persistent volumes and the difference from Docker volumes? Also, what is your perspective of Flocker?

A. Kubernetes has the concept of persistent storage. This abstraction is also called a volume. In addition, Kubernetes provides a plug-in option as well. The Kubernetes implementation predates the Docker Volume and is currently not compatible.

Q. Comment on mainframe: IBM runs Linux on zSeries, therefore can run Linux Docker containers.

A. Thanks, that’s good to know.

Q. How many operating systems changes on the x86 platform? How many on the mainframe platform? Can x86 architecture run the same code/OS from 40 years ago? Docker on mainframe?

A. The mainframe architecture has been very solid and consistent for many years.

Q. What is a big challenge for storage in container environment?

A. I don’t think storage has a challenge in the container environment. I think, with a properly implemented Docker Volume Plug-in, storage provides a solution to the persistent shared storage need in a container environment.

Q. Do you ever look into RexRay or VMDK storage drivers?

A. Yes, these are both examples of Docker Volume plug-in implementations.

 

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Learn How to Develop Interoperable Cloud Encryption and Access Control

mac

Nov 21, 2016

title of post
SNIA Cloud is hosting a live webcast on December 20th, “Developing Interoperable Cloud Encryption and Access Control,” to discuss and demonstrate encrypted objects and delegated access control. For the data protection needs of sharing health and other data across different cloud services, this webcast will explore the capabilities of the Cloud Data Management Interface (CDMI) in addressing these requirements and show implementations of CDMI extensions for a health care example. See it in action! This webcast will include a demonstration by Peter van Liesdonk of Philips who will share the results of testing at the SDC 2016 Cloud Plugfest for Encrypted Objects and Delegated Access Control extensions to CDMI 1.1.1. You’ll will see and learn:
  • New CDMI features (Encrypted Objects and Delegated Access Control)
  • Implementation experiences with new features
  • A live demo of a healthcare-based example
Register today. My colleagues, Peter van Liesdonk, David Slik and I will be on-hand to answer any questions you may have. We hope to see you there.  

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Learn How to Develop Interoperable Cloud Encryption and Access Control

mac

Nov 21, 2016

title of post

SNIA Cloud is hosting a live webcast on December 20th, “Developing Interoperable Cloud Encryption and Access Control,” to discuss and demonstrate encrypted objects and delegated access control. For the data protection needs of sharing health and other data across different cloud services, this webcast will explore the capabilities of the Cloud Data Management Interface (CDMI) in addressing these requirements and show implementations of CDMI extensions for a health care example.

See it in action! This webcast will include a demonstration by Peter van Liesdonk of Philips who will share the results of testing at the SDC 2016 Cloud Plugfest for Encrypted Objects and Delegated Access Control extensions to CDMI 1.1.1.

You’ll will see and learn:

  • New CDMI features (Encrypted Objects and Delegated Access Control)
  • Implementation experiences with new features
  • A live demo of a healthcare-based example

Register today. My colleagues, Peter van Liesdonk, David Slik and I will be on-hand to answer any questions you may have. We hope to see you there.

 

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Common Questions on Clustered File Systems

John Kim

Nov 18, 2016

title of post
More than 350 people have already seen our SNIA Ethernet Storage Forum (ESF) webcast "Clustered File Systems: No Limits." Our presenters, James Coomer and Jerry Lotto, did a great job explaining what clustered file systems are, key considerations, choices and performance. As we expected, there were plenty of questions, so as promised, here are answers to them all. Q: Parallel NFS (pNFS) has been in development/standard effort for a long time, and I believe pNFS is not in the Linux kernel it appears pNFS is yet to be prime time. A: pNFS has been in Linux for over a decade! Clients and server are widely available, and you should look at the SNIA White Paper "An Updated Overview of NFSv4; NFSv4.0, NFSv4.1, pNFS, and NFSv4.2" for more information on the current state of play. Q: Why the emphasis on parallel I/O? Any single storage server can feed results at link capacity, so you do not need multiple storage servers to feed a client at full speed. Isn't the more critical issue the bottleneck on access to metadata for a single directory or file? Federated NAS bottlenecks updates for each directory behind a single master server? A: Any one storage server can usually saturate one client, but often there are multiple hungry clients making requests simultaneously. Using parallel I/O allows multiple servers to feed multiple high-bandwidth clients across a narrow or wide set of data. This smooths out the I/O load on the servers in a near-perfect manner regardless of the number of clients performing I/O. It is absolutely true that metadata serving can become a bottleneck, so parallel file systems use cached and/or distributed metadata to overcome this and again, every client takes part in that interaction and shares some responsibility for managing communicating metadata updates. Q: Can any application access parallel file system (i.e. through an agent in the driver level)? Or does it require specific code within the application? A: Native access to a parallel file system requires a specific client or agent in the host, but many parallel file systems allow any client to access the data through a NAS protocol gateway. No changes are needed to applications to use a parallel file system – These parallel file systems are mounted as a POSIX compliant file system and therefore adhere to basically the same standards as an NFS mount for example. Q: Are parallel file system clients compatible with scale-out NAS servers? A: Nearly all scale-out NAS servers speak a standard NAS protocol like NFS or SMB. Clients running a parallel file system client can also access NAS via these standard protocols. Exceptions to this may possibly (but none that we know of) occur for scale-out NAS servers that support a modified NFS/SMB protocol or a custom NAS client which might conceivably conflict with the parallel file system client when installed on an OS. Q: Of course I am biased, but I am fond of the AFS (Andrew File System) Family of File Systems.     There is OpenAFS, but there is also what we are doing at AuriStor extending beyond the core AFS global namespace model (security functionality, and performance) A: AFS is another distributed file system which supports large scale deployments, native clients for many platforms, and strong security features. It also uses local caching of files to improve performance. It uses a weakly consistent file locking system so multiple clients can access the same file simultaneously but they cannot both update the same file at the same time. OpenAFS is an open-source implementation of AFS. Auristor (formerly Your File System, Inc.) is a startup providing a commercial parallel file system that is compatible with AFS. Q: I am more familiar with Veritas Cluster File System, could you please do a quick compare with Lustre or GPFS? A: The Veritas Cluster File System (formerly VxCFS, now part of Veritas InfoScale) is a distributed file system that runs on Linux and popular flavors of Unix. It supports up to 64 nodes and allows multiple nodes to share the same back-end storage hardware. Comparing it to Lustre and GPFS is beyond the scope of this webinar, but in basic terms, parallel file systems can offer far greater scalability and bandwidth for example, through the use of optimized RDMA clients for high performance networks. Q: Why do file apps need shared access to data, but block apps do not? A: Traditionally block storage did not offer shared access to data (except when used as shared back-end storage for a clustered file system), while apps that needed shared access to data usually chose to use a NAS protocol such as SMB or NFS. So in many cases file-based apps use file sharing protocols because they need shared access to data from multiple clients. (In other cases file-based applications do not require sharing but the storage administrators believe it's easier to manage or less expensive than networked block storage.) Q: Do Lustre and GPFS have SMB Direct support? A: Not today. SMB Direct is an option to use RDMA and multi-channel with the SMB 3 protocol. Both Lustre and GPFS support the ability to export a file system via NFS or SMB, but generally they do not support SMB Direct yet. Both Lustre and GPFS support RDMA access through their clients. How to the clients avoid doing simultaneous writes to the same file? A: Some parallel file systems allow this by letting different clients write to different parts of the same file. Others do not allow this. In either case, distributed file locking is used to prevent two clients from writing simultaneously to the same part of a file (or to the same file if it's not allowed). Q: How can you say that the application "does not have to worry about" how the clustered file system serializes writes? Doesn't this require continuous end-to-end connectivity? A: When the application writes data it generally writes to a POSIX-compliant file system and does not need to worry about how the parallel file system serializes, distributes, or protects the data because this is virtualized (managed) by the file system. It usually does require continuous end-to-end connectivity from the clients to the servers, though in some cases caching could allow for brief gaps in connectivity and in some systems not every client needs to have network connectivity to every server. There are multiple mechanisms within parallel file systems to manage the various cases of clients/servers disappearing from the network, temporarily or permanently (whilst for example holding a lock). Q: How does a parallel file system handle the sequences of write on a same file? Just append one by one? What if a client modified a line? A: This is the biggest challenge for and reason to use a parallel file system.   Beneath the covers, coherency is maintained by Spectrum Scale using a token management server process which issues locks for object requests.  Similar functionality is implemented in Lustre using a distributed lock manager.   These objects are most commonly blocks within files rather than entire files, but this is application controlled.   The end result is a POSIX-compliant interface that scales to thousands of clients. Q: What does FPO stand for? A: File Placement Optimizer – a shared-nothing architecture and licensing model for IBM Spectrum Scale (aka GPFS). Learn more here. Q: Is there a concept in parallel file systems for "auto-tuning" yet? Seems like the early days of SAN management and tuning... A: Default tuning values are optimized for general purpose workloads, but the whole purpose of tuning parameters is to adjust away from those defaults to optimize the file system for a particular application workload or fil esystem architecture.   Both IBM and OpenSFS with the support of Intel have published extensive documentation on best practices for optimization and tuning for either file system.   We are not aware of any work on "automating" that process but there has been recent work (e.g. in spectrum scale) to simplify the tuning process. Q: Which is better as interconnect between disk and servers, shared access or share-nothing? A: The use of shared access in the interconnect between disks and servers is limited to providing HA functionality in Lustre or Spectrum Scale, the ability to service I/O requests to a storage device if the server which has primary responsibility for that device is not available.   This usually involves multiple server-attached external storage which can add cost to building the file system.   The alternative approach to HA is to replicate blocks of data to different disks on different servers, cutting back on the usable capacity of the file system.   If HA is not a requirement, a share-nothing architecture will generally involve less hardware and therefore be less expensive to build. If you have more questions, please comment on this blog. And I encourage you to check out the SNIA ESF webcast library for educational, vendor-neutral content on Ethernet networked storage topics. Update: If you missed the live event, it's now available  on-demand. You can also  download the webcast slides.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Common Questions on Clustered File Systems

John Kim

Nov 18, 2016

title of post

More than 350 people have already seen our SNIA Ethernet Storage Forum (ESF) webcast “Clustered File Systems: No Limits.” Our presenters, James Coomer and Jerry Lotto, did a great job explaining what clustered file systems are, key considerations, choices and performance. As we expected, there were plenty of questions, so as promised, here are answers to them all.

Q: Parallel NFS (pNFS) has been in development/standard effort for a long time, and I believe pNFS is not in the Linux kernel it appears pNFS is yet to be prime time.

A: pNFS has been in Linux for over a decade! Clients and server are widely available, and you should look at the SNIA White Paper “An Updated Overview of NFSv4; NFSv4.0, NFSv4.1, pNFS, and NFSv4.2” for more information on the current state of play.

Q: Why the emphasis on parallel I/O? Any single storage server can feed results at link capacity, so you do not need multiple storage servers to feed a client at full speed. Isn’t the more critical issue the bottleneck on access to metadata for a single directory or file? Federated NAS bottlenecks updates for each directory behind a single master server?

A: Any one storage server can usually saturate one client, but often there are multiple hungry clients making requests simultaneously. Using parallel I/O allows multiple servers to feed multiple high-bandwidth clients across a narrow or wide set of data. This smooths out the I/O load on the servers in a near-perfect manner regardless of the number of clients performing I/O. It is absolutely true that metadata serving can become a bottleneck, so parallel file systems use cached and/or distributed metadata to overcome this and again, every client takes part in that interaction and shares some responsibility for managing communicating metadata updates.

Q: Can any application access parallel file system (i.e. through an agent in the driver level)? Or does it require specific code within the application?

A: Native access to a parallel file system requires a specific client or agent in the host, but many parallel file systems allow any client to access the data through a NAS protocol gateway. No changes are needed to applications to use a parallel file system – These parallel file systems are mounted as a POSIX compliant file system and therefore adhere to basically the same standards as an NFS mount for example.

Q: Are parallel file system clients compatible with scale-out NAS servers?

A: Nearly all scale-out NAS servers speak a standard NAS protocol like NFS or SMB. Clients running a parallel file system client can also access NAS via these standard protocols. Exceptions to this may possibly (but none that we know of) occur for scale-out NAS servers that support a modified NFS/SMB protocol or a custom NAS client which might conceivably conflict with the parallel file system client when installed on an OS.

Q: Of course I am biased, but I am fond of the AFS (Andrew File System) Family of File Systems.   There is OpenAFS, but there is also what we are doing at AuriStor extending beyond the core AFS global namespace model (security functionality, and performance)

A: AFS is another distributed file system which supports large scale deployments, native clients for many platforms, and strong security features. It also uses local caching of files to improve performance. It uses a weakly consistent file locking system so multiple clients can access the same file simultaneously but they cannot both update the same file at the same time. OpenAFS is an open-source implementation of AFS. Auristor (formerly Your File System, Inc.) is a startup providing a commercial parallel file system that is compatible with AFS.

Q: I am more familiar with Veritas Cluster File System, could you please do a quick compare with Lustre or GPFS?

A: The Veritas Cluster File System (formerly VxCFS, now part of Veritas InfoScale) is a distributed file system that runs on Linux and popular flavors of Unix. It supports up to 64 nodes and allows multiple nodes to share the same back-end storage hardware. Comparing it to Lustre and GPFS is beyond the scope of this webinar, but in basic terms, parallel file systems can offer far greater scalability and bandwidth for example, through the use of optimized RDMA clients for high performance networks.

Q: Why do file apps need shared access to data, but block apps do not?

A: Traditionally block storage did not offer shared access to data (except when used as shared back-end storage for a clustered file system), while apps that needed shared access to data usually chose to use a NAS protocol such as SMB or NFS. So in many cases file-based apps use file sharing protocols because they need shared access to data from multiple clients. (In other cases file-based applications do not require sharing but the storage administrators believe it’s easier to manage or less expensive than networked block storage.)

Q: Do Lustre and GPFS have SMB Direct support?

A: Not today. SMB Direct is an option to use RDMA and multi-channel with the SMB 3 protocol. Both Lustre and GPFS support the ability to export a file system via NFS or SMB, but generally they do not support SMB Direct yet. Both Lustre and GPFS support RDMA access through their clients.

How to the clients avoid doing simultaneous writes to the same file?

A: Some parallel file systems allow this by letting different clients write to different parts of the same file. Others do not allow this. In either case, distributed file locking is used to prevent two clients from writing simultaneously to the same part of a file (or to the same file if it’s not allowed).

Q: How can you say that the application “does not have to worry about” how the clustered file system serializes writes? Doesn’t this require continuous end-to-end connectivity?

A: When the application writes data it generally writes to a POSIX-compliant file system and does not need to worry about how the parallel file system serializes, distributes, or protects the data because this is virtualized (managed) by the file system. It usually does require continuous end-to-end connectivity from the clients to the servers, though in some cases caching could allow for brief gaps in connectivity and in some systems not every client needs to have network connectivity to every server. There are multiple mechanisms within parallel file systems to manage the various cases of clients/servers disappearing from the network, temporarily or permanently (whilst for example holding a lock).

Q: How does a parallel file system handle the sequences of write on a same file? Just append one by one? What if a client modified a line?

A: This is the biggest challenge for and reason to use a parallel file system.  Beneath the covers, coherency is maintained by Spectrum Scale using a token management server process which issues locks for object requests.  Similar functionality is implemented in Lustre using a distributed lock manager.  These objects are most commonly blocks within files rather than entire files, but this is application controlled.  The end result is a POSIX-compliant interface that scales to thousands of clients.

Q: What does FPO stand for?

A: File Placement Optimizer – a shared-nothing architecture and licensing model for IBM Spectrum Scale (aka GPFS). Learn more here.

Q: Is there a concept in parallel file systems for “auto-tuning” yet? Seems like the early days of SAN management and tuning…

A: Default tuning values are optimized for general purpose workloads, but the whole purpose of tuning parameters is to adjust away from those defaults to optimize the file system for a particular application workload or fil esystem architecture.  Both IBM and OpenSFS with the support of Intel have published extensive documentation on best practices for optimization and tuning for either file system.  We are not aware of any work on “automating” that process but there has been recent work (e.g. in spectrum scale) to simplify the tuning process.

Q: Which is better as interconnect between disk and servers, shared access or share-nothing?

A: The use of shared access in the interconnect between disks and servers is limited to providing HA functionality in Lustre or Spectrum Scale, the ability to service I/O requests to a storage device if the server which has primary responsibility for that device is not available.  This usually involves multiple server-attached external storage which can add cost to building the file system.  The alternative approach to HA is to replicate blocks of data to different disks on different servers, cutting back on the usable capacity of the file system.  If HA is not a requirement, a share-nothing architecture will generally involve less hardware and therefore be less expensive to build.

If you have more questions, please comment on this blog. And I encourage you to check out the SNIA ESF webcast library for educational, vendor-neutral content on Ethernet networked storage topics

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Recognize Volunteer Contributions - Nominations Open for the SNIA Individual and Group Recognition Program

khauser

Nov 16, 2016

title of post
Each year, at the Annual Members Symposium, SNIA members recognize their own - volunteers and organizations who have dedicated expertise and time to contribute to the important work done by SNIA technical work groups, committees, and initiatives.  SNIA recognizes with a "Volunteer of the Year" award an individual contributor  who has stepped up to help SNIA achieve new and groundbreaking work or significantly advanced an existing program.  Past winners have included Mark Carlson of Toshiba, Jim Ryan of Intel, and Alex McDonald of NetApp.  Wayne Adams accepting award finalWith the Exceptional Leadership award, SNIA recognizes an individual who has advanced a cause for SNIA leading to an impact on the industry or the Association.  Past winners have included Wayne Adams of EMC, Eric Hibbard of Hitachi, and Paul von Behren of Intel. SNIA also recognizes unsung heroes who work tirelessly under the radar expecting no attention but who in fact probably deserve more than the rest, and new contributors of the year who begin work in new areas. SNIA also recognizes groups with several awards, including outstanding achievement of a SNIA Technology Community, significant contribution by a SNIA Committee or Regional Affiliate, significant impact by a previously existing SNIA Technical Work Group or Task Force, and contributions by new SNIA groups.  winners2Previous recipients have been acknowledged for their work in Persistent Memory, Solid State Storage, Storage Management, and Object Drives, and with SNIA India and the SNIA Global Steering Committee.  A list of all individuals and groups recognized since 2008 can be found at http://www.snia.org/about/awards. Also at the Annual Members Symposium, SNIA honors Deborah Kay Johnson, a SNIA member whose volunteer dedication to educating the industry on technology left a lasting impact, with the Deborah Kay Johnson Memorial Award.  Past winners of this award for their outstanding contributions to education include Charles Tasse, Dell; Nancy Clay, SNIA; and David Deming, Solution Technology; all recipients are listed at http://www.snia.org/about/awards/dkj. It's time for the 2016 awards, and SNIA encourages all members to enter their nominations for both individual and group categories.  The window to submit is open until December 9 and your selections can be made at this link.  Awards will be announced during the SNIA Annual Members Symposium, January 17-20, 2017, at the Westin San Jose.  Register here to attend the Symposium and view the agenda.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Storage Basics Q&A and No One’s Pride was Hurt

J Metz

Nov 7, 2016

title of post

In the first of our “Everything You Wanted To Know About Storage But Were Too Proud To Ask – Part Chartreuse,” we covered the storage basics to break down the entire storage picture and identify the places where most of the confusion falls. It was a very well attended event and I’m happy to report, everyone’s pride stayed intact! We got some great questions from the audience, so as promised, here are our answers to all of them:

Q. What is parity? What is XOR?

A. In RAID, there are generally two kinds of data that are stored: the actual data and the parity data. The actual data is obvious; parity data is information about the actual data that you can use to reconstruct it if something goes wrong.

It’s important to note that this is not simply a copy of A and B, but rather a logical operation that is applied to the data. Commonly for RAID (other than simple mirroring) the method used is called an exclusive or, or XOR for short. The XOR function outputs true only when inputs differ (one is true, the other is false).

There’s a neat feature about XOR, and the reason it’s used by RAID. Calculate the value A XOR B (let’s call it AxB). Here’s an example on a pair of bytes.

A                                  10011100

B                                  01101100

A XOR B is AxB              11110000

Store all three values on separate disks. Now, if we lose A or B, we can use the fact that AxB XOR B is equal to A, and AxB XOR A is equal to B. For example, for A;

B                                  01101100

AxB                              11110000

A XOR AxB is A              10011100

We’ve regenerated the A we lost. (If we lose the parity bits, they can just be reconstructed from A and B.)

Q. What is common notation for RAID? I have seen RAID 4+1, and RAID (4,1). In the past, I thought this meant a total of 5 disks, but in your explanation it is only 4 disks.

A. RAID is notated by levels, which is determined by the way in which data is laid out on disk drives (there are always at least two). When attempting to achieve fault tolerance, there is always a trade-off between performance and capacity. Such is life.

There are 4 common RAID levels in use today (there are others, but these are the most common): RAID 0, RAID 1, RAID 5, and RAID 6. As a quick reminder from the webinar (you can see pictures of these in action there):

  • RAID 0: Data is striped across the disks without any parity. Very fast, but very unsafe (if you lose one, you lose all)
  • RAID 1: Data is mirrored between disks without any parity. Slowest, but you have an exact copy of the data so there is no need to recalculate anything to reconstruct the data.
  • RAID 5: Data is striped across multiple disks, and the parity is striped across multiple disks. Often seen as the best compromise: Fast writes and good safety net. Can withstand one disk loss without losing data.
  • RAID 6: Data is striped across multiple disks, and two parity bits are stored on all the disks. Same advantages of RAID 5, except now you can lose 2 drives before data loss.

Now, if you have enough disks, it is possible to combine RAID levels. You can, for instance, have four drives that combine mirroring and striping. In this case, you can have two sets of drives that are mirrored to each other, and the data is striped to each of those sets. That would be RAID 1+0, or often called RAID 10. Likewise, you can have two sets of RAID 5 drives, and you could stripe or mirror to each of those sets, and it would be RAID 50 or RAID 51, respectively.

Erasure Coding has a different notation, however. It does not use levels like RAID; instead, EC identifies the number of data bits and the number of parity bits.

So, with EC, you take a file or object and split it into ‘k’ blocks of equal size. Then, you take those k blocks and generate n blocks of the same size, such that any k out of n blocks suffice to reconstruct the original file. This results in a (n,k) notation for EC.

Since RAID is a subset of EC, RAID6 is the equivalent of EC or RAID(n,2) or n data disks and 2 parity disks. RAID(4,1) is RAID5 with 4 data and 1 parity, and so on.

Q. Which RAIDs are classified/referred to as EC? I have often heard people refer to RAID 5/6 as EC. Is this only limited to 5/6?

A. All RAID levels are types of EC. The math is slightly different; traditional RAID uses XOR, and EC uses Galois Fields or polynomial arithmetic.

Q. What’s the advantage of RAID5 over RAID1?

A. As noted above, there is a tradeoff between the amount of capacity that you need in order to stay fault tolerant, and the performance you wish to have in any system.

RAID 1 is a mirrored system, where you have a single block of data being written twice – one to each disk. This is done in parallel, so it doesn’t take any extra time to do the write, but there’s no speed-up either. One advantage, however, is that if a disk fails there is no need to perform any logical calculations to reconstruct data – you already have a copy of the intact data.

RAID 5 is more distributed. That is, blocks of data are written to multiple disks simultaneously, along with a parity block. That is, you are breaking up the writing obligations across multiple disks, as well as sending parity data across multiple disks. This significantly speeds up the write process, but more importantly it also distributes the recovery capabilities as well so that any disk can fail without losing data.

Q. So RAID improves WRITES? I guess because it breaks the data into smaller pieces that can be written in parallel. If this is true, then why will READ not benefit from RAID? Isn’t it that those pieces can be read and re-combined into a larger piece from parallel sources would be faster?

A. RAID and the “striping” of IO can improve writes by reducing serialization by allowing us to write anywhere. But a specific block can only be read from the disk it was written to, and if we’re already reading or writing to that disk and it’s busy – we must wait.

Q. Why is EC better for object stores than RAID?

A. Because there’s more redundancy, EC can be made to operate across unreliable and less responsive links, and at potentially geographic scales.

Q: Can you explain about the “RAID Penalty?” I’ve heard it called “Write Penalty” or “Read before Write penalty.”

A. When updating data that’s already been written to disk, there’s a requirement to recalculate the parity data used by RAID. For example, if we update a single byte in a block, we need to read all the blocks, recalculate the parity, and write back the updated data block and the parity block (twice in the case of dual parity RAID6).

There are some techniques that can be used to improve the performance impact. For example, some systems don’t update blocks in place, but use pointer-based systems and only write new blocks. This technique is used by flash-based SSDs as the write size is often 256KB or larger. This can be done in the drive itself, or by the RAID or storage system software. It is very important to avoid when using Erasure Coding as there are so many data blocks and parity blocks to recalculate and rewrite that it would become prohibitive to do an update.

Q. What is the significance of RAIN? We have not heard much about it.

A.A Redundant Array of Independent Nodes works under the same principles of RAID – that is, each node is treated as a failure domain that must be avoided as a Single Point of Failure (SPOF).Where as RAID maintains an understanding of data placement on individual drives within a node, RAIN maintains an understanding of data placement on nodes (that contain drives) within a storage environment.

Q. Is host same as node?

A. At its core, a “node” is an endpoint. So, a host can be a node, but so can a storage device at the other end of the wire.

Q. Does it really matter what Erasure Coding (EC) technologies are named or is EC just EC?

A. A. Erasure Coding notation refers to the level of resilience involved. This notation underscores not only the write patterns for storage of data, but also the mechanisms necessary for recovery. What ‘matters’ really will depend upon the level of involvement for those particular tasks.

Q. Is the Volume Manager concept related to Logical Unit Numbering (LUNs)?

A. It can be. A volume manager is an abstraction layer that allows a host operating system to create a Volume out of one or more media locations. These locations can be either logical or physical. A LUN is an aggregation of media on the target/storage side. You can use a Volume Manager to create a single, logical volume out of multiple LUNs, for instance.

A. For additional information on this, you may want to watch our SNIA-ESF webcast, “Life of a Storage Packet (Walk).”

Q. What’s the relationship between disk controller and volume manager?

A. Following on the last question, a disk controller does exactly what it sounds like – it controls disks. A RAID controller, likewise, controls disks and the read/write mechanisms. Some RAID controllers have additional software abstraction capabilities that can act as a volume manager as well.

We hope these answers clear things up a bit more. As you know, our “Everything You Wanted To Know About Storage, But Were Too Proud To Ask” is a series, since this Chartreuse event, we’ve done “Part Mauve – The Architecture Pod” where we explained channel vs. bus, control plane vs. data plane and fabric vs. network. Check it out on-demand and follow us on Twitter @SNIAESF for announcements on upcoming webcasts.

 

 

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to