Sorry, you need to enable JavaScript to visit this website.

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Erik Smith

Mar 25, 2024

title of post
Generative AI is front page news everywhere you look. With advancements happening so quickly, it is hard to keep up. The SNIA Networking Storage Forum recently convened a panel of experts from a wide range of backgrounds to talk about Gen AI in general and specifically discuss how dataflow bottlenecks can constrain Gen AI application performance well below optimal levels. If you missed this session, “Accelerating Generative AI: Options for Conquering the Dataflow Bottlenecks,” it’s available on-demand at the SNIA Educational Library. We promised to provide answers to our audience questions, and here they are. Q: If ResNet-50 is a dinosaur from 2015, which model would you recommend using instead for benchmarking? A: Setting aside the unfair aspersions being cast on the venerable ResNet-50, which is still used for inferencing benchmarks 😊, we suggest checking out the MLCommons website. In the benchmarks section you’ll see multiple use cases on Training and Inference. There are multiple benchmarks available that can provide more information about the ability of your infrastructure to effectively handle your intended workload. Q: Even if/when we use optics to connect clusters, there is a roughly 5ns/meter delay for the fiber between clusters. Seems like that physical distance limit almost mandates alternate ways of programming optimization to ‘stitch’ the interplay between data and compute? A: With regards to the use of optics versus copper to connect clusters, signals propagate through fiber and copper at about the same speed, so moving to an all-optical cabling infrastructure for latency reduction reasons is probably not the best use of capital. Also, even if there were a slight difference in the signal propagation speed through a particular optical or copper based medium, 5ns/m is small compared to switch and NIC packet processing latencies (e.g., 200-800 ns per hop) until you get to full metro distances. In addition, the software latencies are 2-6 us on top of the physical latencies for the most optimized systems. For AI fabrics data/messages are pipelined, so the raw latency does not have much effect. Interestingly, the time for data to travel between nodes is only one of the limiting factors when it comes to AI performance limitations and it’s not the biggest limitation either. Along these lines, there’s a phenomenal talk by Stephen Jones (NVIDIA) “How GPU computing works” that explains how latency between GPU and Memory impacts the overall system efficiency much more than anything else. That said, the various collective communication libraries (NCCL, RCCL, etc) and in network compute (e.g., SHARP) can have a big impact on the overall system efficiency by helping to avoid network contention. Q: Does this mean that GPUs are more efficient to use than CPUs and DPUs? A: GPUs, CPUs, AI accelerators, and DPUs all provide different functions and have different tradeoffs. While a CPU is good at executing arbitrary streams of instructions through applications/programs, embarrassingly parallelizable workloads (e.g., matrix multiplications which are common in deep learning) can be much more efficient when performed by GPUs or AI accelerators due to the GPUs’ and accelerators’ ability to execute linear algebra operations in parallel. Similarly, I wouldn’t use a GPU or AI accelerator as a general-purpose data mover, I’d use a CPU or an IPU/DPU for that. Q: With regards to vector engines, are there DPUs, switches (IB or Ethernet) that contain vector engines? A: There are commercially available vector engine accelerators but currently there are no IPUs/DPUs or switches that provide this functionality natively. Q: One of the major bottlenecks in modern AI is GPU to GPU connectivity. Ex. NVIDIA uses a proprietary GPU-GPU interconnect, At DGX-2 the focus was on 16 GPUs within a single box with NVSwitch, but then with A100 NVIDIA pulled this back to 8GPUs. But then expanded on that to a super-pod and a second level of switching to get to 256GPUS. How does NVlink, or other proprietary GPU to GPU interconnects address bottlenecks? And why has industry focused on an 8 GPU deployment vs a 16 GPU deployment resolution, given that LLMs are not training on 10’s of thousands of GPUs? A: GPU-GPU interconnects all addresses bottlenecks in the same way that other high-speed fabrics do. GPU-GPU have direct connections featuring large bandwidth, optimized interconnect (point to point or parallel paths), and lightweight protocols. These interconnects have so far been proprietary and not interoperable across GPU vendors. The number of GPUs in a server chassis is dependent on many practical factors, e.g., 8 Gaudis per server leveraging standard RoCE ports provides a good balance to support training and inference. Q: How do you see the future of blending of memory and storage being enabled for generative AI workloads and the direction of “unified” memory between accelerators, GPUs, DPUs and CPUs? A: If by unified memory, you mean centralized memory that can be treated like a resource pool and be consumed by GPUs in place of HBM or by CPUs/DPUs in place of DRAM, then we do not believe we will see unified memory in the foreseeable future. The primary reason is latency. To have a unified memory would require centralization. Even if you were to constrain the distance (i.e., between the end-devices and the centralized memory) to be a single rack, the latency increase caused by the extra circuitry and physical length of the transport media (at 5ns per meter) could be detrimental to performance. However, the big problem with resource sharing is contention. Whether it be congestion in the network or contention at the centralized resource access point (interface), sharing resources requires special handling that will be challenging in the general case. For example, with 10 “compute” nodes attempting to access a pool of memory on a CXL Type 3 device, many of the nodes will end up waiting an unacceptably long period of time for a response. If by unified memory, you mean creating a new “capacity” tier of memory that is more performant than SSD and less performant than DRAM, then CXL Type 3 devices appear to be the way the industry will address that use case, but it may be a while before we see mass adoption. Q: Do you see the hardware design to more specialized into the AI/ML phases (training, inference, etc.)? But today’s enterprise deployments you can have the same hardware performing several tasks in parallel. A: Yes, not only have specialized HW offerings (e.g., accelerators) already been introduced (such as in consumer laptops combining CPUs with inference engines), but also specialized configurations that have been optimized for specific use cases (e.g., inferencing) to be introduced as well. The reason is related to the diverse requirements for each use case. For more information, see the OCP Global Summit 23 presentation “Meta’s evolution of network AI” (specifically starting at time stamp 4:30). They describe how different use cases stress the infrastructure in different ways. That said, there is value in accelerators and hardware being able to address any of the work types for AI so that a given cluster can run whichever mix of jobs is required at a given time. Q: Google leaders like Amin Vahdat have been casting doubts on the possibility of significant acceleration far from the CPU. Can you elaborate further on positioning data-centric compute in the face of that challenge? A: This is a multi-billion-dollar question! There isn’t an obvious answer today. You could imagine building a data processing pipeline with data transform accelerators ‘far’ from where the training and inferencing CPU/accelerators are located. You could build a full “accelerator only” training pipeline if you consider a GPU to be an accelerator not a CPU. The better way to think about this problem is to consider that there is no single answer for how to build ML infrastructure. There is also no single definition of CPU vs accelerator that matters in constructing useful AI infrastructure solutions. The distinction comes down to the role of the device within the infrastructure. With emerging ‘chiplet’ and similar approaches we will see the lines and distinctions blur further. What is significant in what Vahdat and others have been discussing: fabric/network/memory construction plus protocols to improve bandwidth, limit congestion, and reduce tail latency when connecting the data to computational elements (CPU, GPU, AI accelerators, hybrids) will see significant evolution and development over the next few years.   The post Q&A for Accelerating Gen AI Dataflow Bottlenecks first appeared on SNIA on Data, Networking & Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Hidden Costs of AI Q&A

Erik Smith

Mar 14, 2024

title of post
At our recent SNIA Networking Storage Forum webinar, “Addressing the Hidden Costs of AI,” our expert team explored the impacts of AI, including sustainability and areas where there are potentially hidden technical and infrastructure costs. If you missed the live event, you can watch it on-demand in the SNIA Educational Library. Questions from the audience ranged from training Large Language Models to fundamental infrastructure changes from AI and more. Here are answers to the audience’s questions from our presenters. Q: Do you have an idea of where the best tradeoff is for high IO speed cost and GPU working cost? Is it always best to spend maximum and get highest IO speed possible? A: It depends on what you are trying to do If you are training a Large Language Model (LLM) then you’ll have a large collection of GPUs communicating with one another regularly (e.g., All-reduce) and doing so at throughput rates that are up to 900GB/s per GPU! For this kind of use case, it makes sense to use the fastest network option available. Any money saved by using a cheaper/slightly less performant transport will be more than offset by the cost of GPUs that are idle while waiting for data. If you are more interested in Fine Tuning an existing model or using Retrieval Augmented Generation (RAG) then you won’t need quite as much network bandwidth and can choose a more economical connectivity option. It’s worth noting that a group of companies have come together to work on the next generation of networking that will be well suited for use in HPC and AI environments. This group, the Ultra Ethernet Consortium (UEC), has agreed to collaborate on an open standard and has wide industry backing. This should allow even large clusters (1000+ nodes) to utilize a common fabric for all the network needs of a cluster. Q: We (all industries) are trying to use AI for everything.  Is that cost effective?  Does it cost fractions of a penny to answer a user question, or is there a high cost that is being hidden or eaten by someone now because the industry is so new? A: It does not make sense to try and use AI/ML to solve every problem. AI/ML should only be used when a more traditional, algorithmic, technique cannot easily be used to solve a problem (and there are plenty of these). Generative AI aside, one example where AI has historically provided an enormous benefit for IT practitioners is Multivariate Anomaly Detection. These models can learn what normal is for a given set of telemetry streams and then alert the user when something unexpected happens. A traditional approach (e.g., writing source code for an anomaly detector) would be cost and time prohibitive and probably not be anywhere nearly as good at detecting anomalies. Q: Can you discuss typical data access patterns for model training or tuning? (sequential/random, block sizes, repeated access, etc)? A: There is no simple answer as the access patterns can vary from one type of training to the next. Assuming you’d like a better answer than that, I would suggest starting to look into two resources:
  1. Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
  2. Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.
Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info. A: Yes. You can access the video and a PDF of the presentations slides here. Q: Does this mean we're moving to fewer updates or write once (or infrequently) read mostly storage model?  I'm excluding dynamic data from end-user inference requests. A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal. Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)? A: Fibre Channel is not a great fit for AI use cases for a few reasons:
  • With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
  • If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
  • All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
    1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
    2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.
Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard?  So does this mean NVMe will win? (over FC or TCP?)  Will other AI chip providers have to adopt their own GPUDirect-like protocol? A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”. Q: You've mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage? A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system. Q: What are the presenters' views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem? A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP.  Sometimes the answer is yes. Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models? A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects. Q: Fundamentally -- Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations? A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets. Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide. A: The point of the SSD + HDD slide was to point out the introduction of SSDs:
  1. dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
  2. didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.
Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective? A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses: Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you're using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments. Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads? A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure. Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)? A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities.      

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Hidden Costs of AI Q&A

Erik Smith

Mar 14, 2024

title of post
At our recent SNIA Networking Storage Forum webinar, “Addressing the Hidden Costs of AI,” our expert team explored the impacts of AI, including sustainability and areas where there are potentially hidden technical and infrastructure costs. If you missed the live event, you can watch it on-demand in the SNIA Educational Library. Questions from the audience ranged from training Large Language Models to fundamental infrastructure changes from AI and more. Here are answers to the audience’s questions from our presenters. Q: Do you have an idea of where the best tradeoff is for high IO speed cost and GPU working cost? Is it always best to spend maximum and get highest IO speed possible? A: It depends on what you are trying to do If you are training a Large Language Model (LLM) then you’ll have a large collection of GPUs communicating with one another regularly (e.g., All-reduce) and doing so at throughput rates that are up to 900GB/s per GPU! For this kind of use case, it makes sense to use the fastest network option available. Any money saved by using a cheaper/slightly less performant transport will be more than offset by the cost of GPUs that are idle while waiting for data. If you are more interested in Fine Tuning an existing model or using Retrieval Augmented Generation (RAG) then you won’t need quite as much network bandwidth and can choose a more economical connectivity option. It’s worth noting that a group of companies have come together to work on the next generation of networking that will be well suited for use in HPC and AI environments. This group, the Ultra Ethernet Consortium (UEC), has agreed to collaborate on an open standard and has wide industry backing. This should allow even large clusters (1000+ nodes) to utilize a common fabric for all the network needs of a cluster. Q: We (all industries) are trying to use AI for everything.  Is that cost effective?  Does it cost fractions of a penny to answer a user question, or is there a high cost that is being hidden or eaten by someone now because the industry is so new? A: It does not make sense to try and use AI/ML to solve every problem. AI/ML should only be used when a more traditional, algorithmic, technique cannot easily be used to solve a problem (and there are plenty of these). Generative AI aside, one example where AI has historically provided an enormous benefit for IT practitioners is Multivariate Anomaly Detection. These models can learn what normal is for a given set of telemetry streams and then alert the user when something unexpected happens. A traditional approach (e.g., writing source code for an anomaly detector) would be cost and time prohibitive and probably not be anywhere nearly as good at detecting anomalies. Q: Can you discuss typical data access patterns for model training or tuning? (sequential/random, block sizes, repeated access, etc)? A: There is no simple answer as the access patterns can vary from one type of training to the next. Assuming you’d like a better answer than that, I would suggest starting to look into two resources:
  1. Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
  2. Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.
Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info. A: Yes. You can access the video and a PDF of the presentations slides here. Q: Does this mean we’re moving to fewer updates or write once (or infrequently) read mostly storage model?  I’m excluding dynamic data from end-user inference requests. A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal. Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)? A: Fibre Channel is not a great fit for AI use cases for a few reasons:
  • With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
  • If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
  • All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
    1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
    2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.
Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard?  So does this mean NVMe will win? (over FC or TCP?)  Will other AI chip providers have to adopt their own GPUDirect-like protocol? A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”. Q: You’ve mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage? A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system. Q: What are the presenters’ views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem? A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP.  Sometimes the answer is yes. Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models? A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects. Q: Fundamentally — Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations? A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets. Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide. A: The point of the SSD + HDD slide was to point out the introduction of SSDs:
  1. dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
  2. didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.
Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective? A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses: Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you’re using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments. Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads? A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure. Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)? A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities.       The post Hidden Costs of AI Q&A first appeared on SNIA on Network Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

2024 Year of the Summit Kicks Off – Meet us at MemCon

SNIA CMS Community

Mar 6, 2024

title of post
2023 was a great year for SNIA CMSI to meet with IT professionals and end users in “Summits” to discuss technologies, innovations, challenges, and solutions.  Our outreach at six industry events reached over 16,000 and we thank all who engaged with our CMSI members. We are excited to continue a second “Year of the Summit” with a variety of opportunities to network and converse with you.  Our first networking event will take place March 26-27, 2024 at MemCon in Mountain View, CA. MemCon 2024 focuses on systems design for the data centric era, working with data-intensive workloads, integrating emerging technologies, and overcoming data movement and management challenges.  It’s the perfect event to discuss the integration of SNIA’s focus on developing global standards and delivering education on all technologies related to data.  SNIA and MemCon have prepared a video highlighting several of the key topics to be discussed.
MemCon 2024 Video Preview
At MemCon, SNIA CMSI member and SDXI Technical Work Group Chair Shyam Iyer of Dell will moderate a panel discussion on How are Memory Innovations Impacting the Total Cost of Ownership in Scaling-Up and Power Consumption , discussing impacts on hyperscalers, AI/ML compute, and cost/power. SNIA Board member David McIntyre will participate in a panel on How are Increased Adoption of CXL, HBM, and Memory Protocol Expected to Change the Way Memory and Storage is Used and Assembled? , with insights on the markets and emerging memory innovations. The full MemCon agenda is here. In the exhibit area, SNIA leaders will be on hand to demonstrate updates to the SNIA Persistent Memory Programming Workshop featuring new CXL® memory modules (get an early look at our Programming exercises here) and to provide a first look at a Smart Data Accelerator Interface (SDXI) specification implementation.  We’ll also provide updates on SNIA technical work on form factors like those used for CXL. We will feature a drawing for gift cards at the SNIA hosted coffee receptions and at the Tuesday evening networking reception. SNIA colleagues and friends can register for MemCon with a 15% discount using code SNIA15. And stay tuned for engaging with SNIA at upcoming events in 2024, including a return of the SNIA Compute, Memory, and Storage Summit in May 2024, August 2024 FMS-the Future of Memory and Storage; SNIA SDC in September, and SC24 in Atlanta in November 2024. We’ll discuss each of these in depth in our Year of the Summit blog series. The post 2024 Year of the Summit Kicks Off – Meet us at MemCon first appeared on SNIA Compute, Memory and Storage Blog.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Power Efficiency Measurement – Our Experts Make It Clear – Part 3

title of post
Measuring power efficiency in datacenter storage is a complex endeavor. A number of factors play a role in assessing individual storage devices or system-level logical storage for power efficiency. Luckily, our SNIA experts make the measuring easier! In this SNIA Experts on Data blog series, our experts in the SNIA Solid State Storage Technical Work Group and the SNIA Green Storage Initiative explore factors to consider in power efficiency measurement, including the nature of application workloads, IO streams, and access patterns; the choice of storage products (SSDs, HDDs, cloud storage, and more); the impact of hardware and software components (host bus adapters, drivers, OS layers); and access to read and write caches, CPU and GPU usage, and DRAM utilization. Join us on our journey to better power efficiency as we continue with Part 3: Traditional Differences in Power Consumption: Hard Disk Drives vs Solid State Drives. And if you missed our earlier segments, click on the titles to read them:  Part 1: Key Issues in Power Efficiency Measurement, and Part 2: Impact of Workloads on Power Efficiency Measurement..  Bookmark this blog  and check back in April for the final installment of our four-part series. And explore the topic further in the SNIA Green Storage Knowledge Center. Traditional Differences in Power Consumption: Hard Disk Drives vs Solid State Drives There are significant differences in power efficiency between Hard Disk Drives (HDDs) and Solid State Drives (SSDs). While some commentators have examined differences in power efficiency measurement for HDDs v SSDs, much of the analysis has not accounted for the key power efficiency contributing factors outlined in this blog. As a simple generalization at the individual storage device level, HDDs show higher power consumption than SSDs.  In addition, SSDs have higher performance (IOPS and MB/s) often by an order of magnitude or more.  Hence, cursory consideration of device power efficiency measurement, expressed as IOPS/W or MB/s/W, will typically favor the faster SSD with lower device power consumption. On the other hand, depending on the workload and IO transfer size, HDD devices and systems may exhibit better IOPS/W and MB/s/W if measured to large block sequential RW workloads where head actuators can reside on the disk OD (outer diameter) with limited seek accesses. The above traditional HDD and SSD power efficiency considerations can be described at the device level as involving the following key points: HDDs (Hard Disk Drives):
  1. Mechanical Components: HDDs consist of spinning disks and mechanical read/write heads. These moving parts consume a substantial amount of power, especially during startup and when seeking data.
  2. Idle Power Consumption: Even when not actively reading or writing data, HDDs still consume a notable amount of power to keep the disks spinning and ready to access data
  3. Access Time Impact: The mechanical nature of HDDs leads to longer access times compared to SSDs. This means the drive remains active for longer periods during data access, contributing to higher power consumption.
SSDs (Solid State Drives):
  1. No Moving Parts: SSDs are entirely electronic and have no moving parts. As a result, they consume less power during both idle and active states compared to HDDs
  2. Faster Access Times: SSDs have much faster access times since there are no mechanical delays. This results in quicker data retrieval and reduced active time, contributing to lower power consumption
  3. Energy Efficiency: SSDs are generally more energy-efficient, as they consume less power during read and write operations. This is especially noticeable in laptops and portable devices, where battery life is critical
  4. Less Heat Generation: Due to their lack of moving parts, SSDs generate less heat during operation, which can lead to better thermal efficiency in systems.
In summary, SSDs tend to be more power-efficient than HDDs due to their lack of mechanical components, faster access times, and lower energy consumption during both active and idle states. This power efficiency advantage is one of the reasons why SSDs have become increasingly popular in various computing devices, from laptops to data centers.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Power Efficiency Measurement – Our Experts Make It Clear – Part 3

title of post
Measuring power efficiency in datacenter storage is a complex endeavor. A number of factors play a role in assessing individual storage devices or system-level logical storage for power efficiency. Luckily, our SNIA experts make the measuring easier! In this SNIA Experts on Data blog series, our experts in the SNIA Solid State Storage Technical Work Group and the SNIA Green Storage Initiative explore factors to consider in power efficiency measurement, including the nature of application workloads, IO streams, and access patterns; the choice of storage products (SSDs, HDDs, cloud storage, and more); the impact of hardware and software components (host bus adapters, drivers, OS layers); and access to read and write caches, CPU and GPU usage, and DRAM utilization. Join us on our journey to better power efficiency as we continue with Part 3: Traditional Differences in Power Consumption: Hard Disk Drives vs Solid State Drives. And if you missed our earlier segments, click on the titles to read them:  Part 1: Key Issues in Power Efficiency Measurement, and Part 2: Impact of Workloads on Power Efficiency Measurement..  Bookmark this blog  and check back in April for the final installment of our four-part series. And explore the topic further in the SNIA Green Storage Knowledge Center. Traditional Differences in Power Consumption: Hard Disk Drives vs Solid State Drives There are significant differences in power efficiency between Hard Disk Drives (HDDs) and Solid State Drives (SSDs). While some commentators have examined differences in power efficiency measurement for HDDs v SSDs, much of the analysis has not accounted for the key power efficiency contributing factors outlined in this blog. As a simple generalization at the individual storage device level, HDDs show higher power consumption than SSDs.  In addition, SSDs have higher performance (IOPS and MB/s) often by an order of magnitude or more.  Hence, cursory consideration of device power efficiency measurement, expressed as IOPS/W or MB/s/W, will typically favor the faster SSD with lower device power consumption. On the other hand, depending on the workload and IO transfer size, HDD devices and systems may exhibit better IOPS/W and MB/s/W if measured to large block sequential RW workloads where head actuators can reside on the disk OD (outer diameter) with limited seek accesses. The above traditional HDD and SSD power efficiency considerations can be described at the device level as involving the following key points: HDDs (Hard Disk Drives):
  1. Mechanical Components: HDDs consist of spinning disks and mechanical read/write heads. These moving parts consume a substantial amount of power, especially during startup and when seeking data.
  2. Idle Power Consumption: Even when not actively reading or writing data, HDDs still consume a notable amount of power to keep the disks spinning and ready to access data
  3. Access Time Impact: The mechanical nature of HDDs leads to longer access times compared to SSDs. This means the drive remains active for longer periods during data access, contributing to higher power consumption.
SSDs (Solid State Drives):
  1. No Moving Parts: SSDs are entirely electronic and have no moving parts. As a result, they consume less power during both idle and active states compared to HDDs
  2. Faster Access Times: SSDs have much faster access times since there are no mechanical delays. This results in quicker data retrieval and reduced active time, contributing to lower power consumption
  3. Energy Efficiency: SSDs are generally more energy-efficient, as they consume less power during read and write operations. This is especially noticeable in laptops and portable devices, where battery life is critical
  4. Less Heat Generation: Due to their lack of moving parts, SSDs generate less heat during operation, which can lead to better thermal efficiency in systems.
In summary, SSDs tend to be more power-efficient than HDDs due to their lack of mechanical components, faster access times, and lower energy consumption during both active and idle states. This power efficiency advantage is one of the reasons why SSDs have become increasingly popular in various computing devices, from laptops to data centers. The post Power Efficiency Measurement – Our Experts Make It Clear – Part 3 first appeared on SNIA Compute, Memory and Storage Blog.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

AIOps: The Undeniable Paradigm Shift

Michael Hoard

Mar 4, 2024

title of post
AI has entered every aspect of today’s digital world. For IT, AIOps is creating a dramatic shift that redefines how IT approaches operations. On April 9, 2024, the SNIA Cloud Storage Technologies Initiative will host a live webinar, “AIOps: Reactive to Proactive – Revolutionizing the IT Mindset.” In this webinar, Pratik Gupta, one of the industry’s leading experts in AIOps, will delve beyond the tools of AIOps to reveal how AIOps introduces intelligence into the very fabric of IT thinking and processes, discussing:
  • From Dev to Production and Reactive to Proactive: Revolutionizing the IT Mindset: We’ll move beyond the “fix it when it breaks” mentality, embracing a future-proof approach where AI analyzes risk, anticipates issues, prescribes solutions, and learns continuously.
  • Beyond Siloed Solutions: Embracing Holistic Collaboration:  AIOps fosters seamless integration across departments, applications, and infrastructure, promoting real-time visibility and unified action.
  • Automating the Process: From Insights to Intelligent Action: Dive into the world of self-healing IT, where AI-powered workflows and automation resolve issues and optimize performance without human intervention.
Register here to join us on April 9, 2024 for what will surely be a fascinating discussion on the impact of AIOps.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Emerging Memories Branch Out – a Q&A

SNIA CMSI

Feb 19, 2024

title of post
Our recent SNIA Persistent Memory SIG webinar explored in depth the latest developments and futures of emerging memories – now found in multiple applications both as stand-alone chips and embedded into systems on chips. We got some great questions from our live audience, and our experts Arthur Sainio, Tom Coughlin, and Jim Handy have taken the time to answer them in depth in this blog. And if you missed the original live talk, watch the video and download the PDF here. Q:  Do you expect Persistent Memory to eventually gain the speeds that exist today with DRAM? A:  It appears that that has already happened with the hafnium ferroelectrics that SK Hynix and Micron have shown.  Ferroelectric memory is a very fast technology and with very fast write cycles there should be every reason for it to go that way. With the hooks that are in CXL™, , though, that shouldn't be that much of a problem since it's a transactional protocol. The reads, then, will probably rival DRAM speeds for MRAM and for resistive RAM (MRAM might get up to DRAM speeds with its writes too). In fact, there are technologies like spin-orbit torque and even voltage-controlled magnetic anisotropy that promise higher performance and also low write latency for MRAM technologies. I think that probably most applications are read intensive and so the read is the real place where the focus is, but it does look like we are going to get there. Q:  Are all the new Memory technology protocols (electrically) compatible to DRAM interfaces like DDR4 or DDR5? If not, then shouldn't those technologies have lower chances of adoption as they add dependency on custom in-memory controller? A:  That's just a logic problem.  There's nothing innate about any memory technology that couples it tightly with any kind of a bus, and so because NOR Flash and SRAM are the easy targets so far, most emerging technologies have used a NOR flash or SRAM type interface.  However, in the future they could use DDR.  There're some special twists because you don't have to refresh emerging memory technologies. but you know in general they could use DDR. But one of the beauties of CXL is that you put anything you want to with any kind of interface on the other side of CXL and CXL erases what the differences are. It moderates them so although they may have different performances it's hidden behind the CXL network.  Then the burden goes on to the CXL controller designers to make sure that those emerging technologies, whether it’s MRAM or others, can be adopted behind that CXL protocol. My expectation is for there to be a few companies early on who provide CXL controllers that that do have some kind of a specialty interface on them whether it's for MRAM or for Resistive RAM or something like that, and then eventually for them to move their way into the mainstream.  Another interesting thing about CXL is that we may even see a hierarchy of different memories within CXL itself which also includes as part of CXL including domain specific processors or accelerators that operate close to memory, and so there are very interesting opportunities there as well. If you can do processing close to memory you lower the amount of data you're moving around and you're saving a lot of power for the computing system. Q: Emerging memory technologies have a byte-level direct access programming model, which is in contrast to block-based NAND Flash. Do you think this new programming model will eventually replace NAND Flash as it reduces the overhead and reduces the power of transferring Data? A: It’s a question of cost and that's something that was discussed very much in our webinar. If you haven't got a cost that's comparable to NAND Flash, then you can't really displace it.  But as far as the interface is concerned, the NAND interface is incredibly clumsy. All of these technologies do have both byte interfaces rather than a block interface but also, they can write in place - they don't need to have a pre-erased block to write into. That from a technical standpoint is a huge advantage and now it's just a question of whether or not they can get the cost down - which means getting the volume up. Q: Can you discuss the High Bandwidth Memory (HBM) trends? What about memories used with Graphic Processing Units (GPUs)? A: That topic isn't the subject of this webinar as this webinar is about emerging memory technologies. But, to comment, we don't expect to see emerging memory technologies adopt an HBM interface anytime in the really near future because HBM does springboard off DRAM and, as we discussed on one of the slides, DRAM has a transition that we don't know when it's going to happen that it goes to another emerging memory technology.  We’ve put it into the early 2030s in our chart, but it could be much later than that and HBM won’t convert over to an emerging memory technology until long after that. However, HBM involves stacking of chips and that ultimately could happen.  It's a more expensive process right now -  a way of getting a lot of memory very close to a processor - and if you look at some of the NVIDIA applications for example,  this is an example of the Chiplet technology and HBM can play a role in those Chiplet technologies for GPUs..  That's another area that's going to be using emerging memories as well - in the Chiplets.  While we didn't talk about that so much in this webinar, it is another place for emerging memories to be playing a role. There's one other advantage to using an emerging memory that we did not talk about: emerging memories don’t need refresh. As a matter of fact, none of the emerging memory technologies need refresh. More power is consumed by DRAM refreshing than by actual data accesses.  And so, if you can cut that out of it,  you might be able to stack more chips on top of each other and get even more performance, but we still wouldn't see that as a reason for DRAM to be displaced early on in HBM and then later on in the mainstream DRAM market.  Although, if you're doing all those refreshes there's a fair amount of potential of heat generation by doing that, which may have packaging implications as well. So, there may be some niche areas in there which could be some of the first ways in which some of these emerging memories are potentially used for those kinds of applications, if the performance is good enough. Q:  Why have some memory companies failed?  Apart from the cost/speed considerations you mention, what are the other minimum envelope features that a new emerging memory should have? Is capacity (I heard 32Gbit multiple times) one of those criteria? A: Shipping a product is probably the single most important activity for success. Companies don’t have to make a discrete or standalone SRAM or emerging memory chip but what they need to do is have their technology be adopted by somebody who is shipping something if they're not going to ship it themselves.  That’s what we see in the embedded market as a good path for emerging memory IP: To get used and to build up volume. And as the volume and comfort with manufacturing those memories increase, it opens up the possibility down the road of lower costs with higher volume standalone memory as well. Q:  What are the trends in DRAM interfaces?  Would you discuss CXL's role in enabling composable systems with DRAM pooling? A:  CXL, especially CXL 3.0, has particularly pointed at pooling. Pooling is going to be an extremely important development in memory with CXL, and it's one of the reasons why CXL probably will proliferate. It allows you to be able to allocate memory which is not attached to particular server CPUs and therefore to make more efficient and effective use of those memories. We mentioned this earlier when we said that right now DRAM is that memory with some NAND Flash products out there too. But this could expand into other memory technologies behind CXL within the CXL pool as well as accelerators (domain specific processors) that do some operations closer to where the memory lives. So, we think there's a lot of possibilities in that pooling for the development and growth of emerging memories as well as conventional memories. Q: Do you think molecular-based technologies (DNA or others) can emerge in the coming years as an alternative to some of the semiconductor-based memories? A: DNA and other memory technologies are in a relatively early stage but there are people who are making fairly aggressive plans on what they can do with those technologies. We think the initial market for those molecular memories are not in this high performance memory application; but especially with DNA, the potential density of storage and the fact that you can make lots of copies of content by using genetic genomic processes makes them very attractive potentially for archiving applications.  The things we’ve seen are mostly in those areas because of the performance characteristics. But the potential density that they're looking at is actually aimed at that lower part of the market, so it has to be very, very cost effective to be able to do that, but the possibilities are there.  But again, as with the emerging high performance memories, you still have the economies of scale you have to deal with - if you can't scale it fast enough the cost won't go down enough that will actually will be able to compete in those areas. So it faces somewhat similar challenges, though in a different part of the market. Earlier in the webcast, we said when showing the orb chart, that for something to fit into the computing storage hierarchy it has to be cheaper than the next faster technology and faster than the next cheaper technology. DNA is not a very fast technology and so that automatically says it has to be really cheap for it to catch on and that puts it in a very different realm than the emerging memories that we're talking about here. On the other hand, you never know what someone's going to discover, but right now the industry doesn’t know how to make fast molecular memories. Q:  What is your intuition on how tomorrow's highly dense memories might impact non-load/store processing elements such as AI accelerators? As model sizes continue to grow and energy density becomes more of an issue, it would seem like emerging memories could thrive in this type of environment. Your thoughts? A:  Any memory would thrive in an environment where there was an unbridled thirst for memory. as artificial intelligence (AI) currently is. But AI is undergoing some pretty rapid changes, not only in the number of the parameters that are examined, but also in the models that are being used for it. We recently read a paper that was written by Apple* where they actually found ways of winnowing down the data that was used for a large language model into something that would fit into an Apple MacBook Pro M2 and they were able to get good performance by doing that.  They really accelerated things by ignoring data that didn't really make any difference. So, if you take that kind of an approach and say: “Okay.  If those guys keep working on that problem that way, and they take it to the extreme, then you might not need all that much memory after all.”  But still, if memory were free, I'm sure that there'd be a ton of it out there and that is just a question of whether or not these memories can get cheaper than DRAM so that they can look like they're free compared to what things look like today. There are three interesting elements of this:  First, CXL, in addition allowing mixing of memory types, again allows you to put in those domain specific processors as well close to the memory. Perhaps those can do some of the processing that's part of the model, in which case it would lower the energy consumption. The other thing it supports is different computing models than what we traditionally use. Of course there is quantum computing, but there also is something called neural networks which actually use the memory as a matrix multiplier, and those are using these emerging memories for that technology which could be used for AI applications.  The other thing that's sort of hidden behind this is that spin tunnelling is changing processing itself in that right now everything is current-based, but there's work going on in spintronic based devices that instead of using current would use the spin of electrons for moving data around, in which case we can avoid resistive heating and our processing could run a lot cooler and use less energy to do so.  So, there's a lot of interesting things that are kind of buried in the different technologies being used for these emerging memories that actually could have even greater implications on the development of computing beyond just the memory application themselves.  And to elaborate on spintronics, we’re talking about logic and not about spin memory - using spins rather than that of charge which is current. Q:  Flash has an endurance issue (maximum number of writes before it fails). In your opinion, what is the minimum acceptable endurance (number of writes) that an emerging memory should support? It’s amazing how many techniques have fallen into place since wear was an issue in flash SSDs.  Today’s software understands which loads have high write levels and which don’t, and different SSDs can be used to handle the two different kinds of load.  On the SSD side, flash endurance has continually degraded with the adoption of MLC, TLC, and QLC, and is sometimes measured in the hundreds of cycles.  What this implies is that any emerging memory can get by with an equally low endurance as long as it’s put behind the right controller. In high-speed environments this isn’t a solution, though, since controllers add latency, so “Near Memory” (the memory tied directly to the processor’s memory bus) will need to have higher endurance.  Still, an area that can help to accommodate that is the practice of putting code into memories that have low endurance and data into higher-endurance memory (which today would be DRAM).  Since emerging memories can provide more bits at a lower cost and power than DRAM, the write load to the code space should be lower, since pages will be swapped in and out more frequently.  The endurance requirements will depend on this swapping, and I would guess that the lowest-acceptable level would be in the tens of thousands of cycles. Q: It seems that persistent memory is more of an enterprise benefit rather than a consumer benefit. And consumer acceptance helps the advancement and cost scaling issues. Do you agree? I use SSDs as an example. Once consumers started using them, the advancement and prices came down greatly. Anything that drives increased volume will help.  In most cases any change to large-scale computing works its way down to the PC, so this should happen in time here, too. But today there’s a growing amount of MRAM use in personal fitness monitors, and this will help drive costs down, so initial demand will not exclusively come from enterprise computing. At the same time, the IBM FlashDrive that we mentioned uses MRAM, too, so both enterprise and consumer are already working to simultaneously grow consumption. Q: The CXL diagram (slide 22 in the PDF) has 2 CXL switches between the CPUs and the memory. How much latency do you expect the switches to add, and how does that change where CXL fits on the array of memory choices from a performance standpoint? The CXL delay goals are very aggressive, but I am not sure that an exact number has been specified.  It’s on the order of 70ns per “Hop,” which can be understood as the delay of going through a switch or a controller. Naturally, software will evolve to work with this, and will move data that has high bandwidth requirements but is less latency-sensitive to more remote areas, while managing the more latency-sensitive data to near memory. Q: Where can I learn more about the topic of Emerging Memories? Here are some resources to review   * LLM in a Flash: Efficient Large Language Model Inference with Limited Memory, Kevin Avizalideh, et. al.,             arXiv:2312.11514 [cs.CL]

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Emerging Memories Branch Out – a Q&A

SNIA CMS Community

Feb 19, 2024

title of post
Our recent SNIA Persistent Memory SIG webinar explored in depth the latest developments and futures of emerging memories – now found in multiple applications both as stand-alone chips and embedded into systems on chips. We got some great questions from our live audience, and our experts Arthur Sainio, Tom Coughlin, and Jim Handy have taken the time to answer them in depth in this blog. And if you missed the original live talk, watch the video and download the PDF here. Q:  Do you expect Persistent Memory to eventually gain the speeds that exist today with DRAM? A:  It appears that that has already happened with the hafnium ferroelectrics that SK Hynix and Micron have shown.Ferroelectric memory is a very fast technology and with very fast write cycles there should be every reason for it to go that way. With the hooks that are in CXL™,  though, that shouldn’t be that much of a problem since it’s a transactional protocol. The reads, then, will probably rival DRAM speeds for MRAM and for resistive RAM (MRAM might get up to DRAM speeds with its writes too). In fact, there are technologies like spin-orbit torque and even voltage-controlled magnetic anisotropy that promise higher performance and also low write latency for MRAM technologies. I think that probably most applications are read intensive and so the read is the real place where the focus is, but it does look like we are going to get there. Q:  Are all the new Memory technology protocols (electrically) compatible to DRAM interfaces like DDR4 or DDR5? If not, then shouldn’t those technologies have lower chances of adoption as they add dependency on custom in-memory controller? A:  That’s just a logic problem.  There’s nothing innate about any memory technology that couples it tightly with any kind of a bus, and so because NOR Flash and SRAM are the easy targets so far, most emerging technologies have used a NOR flash or SRAM type interface.  However, in the future they could use DDR.  There’re some special twists because you don’t have to refresh emerging memory technologies. but you know in general they could use DDR. But one of the beauties of CXL is that you put anything you want to with any kind of interface on the other side of CXL and CXL erases what the differences are. It moderates them so although they may have different performances it’s hidden behind the CXL network.  Then the burden goes on to the CXL controller designers to make sure that those emerging technologies, whether it’s MRAM or others, can be adopted behind that CXL protocol. My expectation is for there to be a few companies early on who provide CXL controllers that that do have some kind of a specialty interface on them whether it’s for MRAM or for Resistive RAM or something like that, and then eventually for them to move their way into the mainstream.  Another interesting thing about CXL is that we may even see a hierarchy of different memories within CXL itself which also includes as part of CXL including domain specific processors or accelerators that operate close to memory, and so there are very interesting opportunities there as well. If you can do processing close to memory you lower the amount of data you’re moving around and you’re saving a lot of power for the computing system. Q: Emerging memory technologies have a byte-level direct access programming model, which is in contrast to block-based NAND Flash. Do you think this new programming model will eventually replace NAND Flash as it reduces the overhead and reduces the power of transferring Data? A: It’s a question of cost and that’s something that was discussed very much in our webinar. If you haven’t got a cost that’s comparable to NAND Flash, then you can’t really displace it.  But as far as the interface is concerned, the NAND interface is incredibly clumsy. All of these technologies do have both byte interfaces rather than a block interface but also, they can write in place – they don’t need to have a pre-erased block to write into. That from a technical standpoint is a huge advantage and now it’s just a question of whether or not they can get the cost down – which means getting the volume up. Q: Can you discuss the High Bandwidth Memory (HBM) trends? What about memories used with Graphic Processing Units (GPUs)? A: That topic isn’t the subject of this webinar as this webinar is about emerging memory technologies. But, to comment, we don’t expect to see emerging memory technologies adopt an HBM interface anytime in the really near future because HBM does springboard off DRAM and, as we discussed on one of the slides, DRAM has a transition that we don’t know when it’s going to happen that it goes to another emerging memory technology.  We’ve put it into the early 2030s in our chart, but it could be much later than that and HBM won’t convert over to an emerging memory technology until long after that. However, HBM involves stacking of chips and that ultimately could happen.  It’s a more expensive process right now –  a way of getting a lot of memory very close to a processor – and if you look at some of the NVIDIA applications for example,  this is an example of the Chiplet technology and HBM can play a role in those Chiplet technologies for GPUs..  That’s another area that’s going to be using emerging memories as well – in the Chiplets.  While we didn’t talk about that so much in this webinar, it is another place for emerging memories to be playing a role. There’s one other advantage to using an emerging memory that we did not talk about: emerging memories don’t need refresh. As a matter of fact, none of the emerging memory technologies need refresh. More power is consumed by DRAM refreshing than by actual data accesses.  And so, if you can cut that out of it,  you might be able to stack more chips on top of each other and get even more performance, but we still wouldn’t see that as a reason for DRAM to be displaced early on in HBM and then later on in the mainstream DRAM market.  Although, if you’re doing all those refreshes there’s a fair amount of potential of heat generation by doing that, which may have packaging implications as well. So, there may be some niche areas in there which could be some of the first ways in which some of these emerging memories are potentially used for those kinds of applications, if the performance is good enough. Q:  Why have some memory companies failed?  Apart from the cost/speed considerations you mention, what are the other minimum envelope features that a new emerging memory should have? Is capacity (I heard 32Gbit multiple times) one of those criteria? A: Shipping a product is probably the single most important activity for success. Companies don’t have to make a discrete or standalone SRAM or emerging memory chip but what they need to do is have their technology be adopted by somebody who is shipping something if they’re not going to ship it themselves.  That’s what we see in the embedded market as a good path for emerging memory IP: To get used and to build up volume. And as the volume and comfort with manufacturing those memories increase, it opens up the possibility down the road of lower costs with higher volume standalone memory as well. Q:  What are the trends in DRAM interfaces?  Would you discuss CXL’s role in enabling composable systems with DRAM pooling? A:  CXL, especially CXL 3.0, has particularly pointed at pooling. Pooling is going to be an extremely important development in memory with CXL, and it’s one of the reasons why CXL probably will proliferate. It allows you to be able to allocate memory which is not attached to particular server CPUs and therefore to make more efficient and effective use of those memories. We mentioned this earlier when we said that right now DRAM is that memory with some NAND Flash products out there too. But this could expand into other memory technologies behind CXL within the CXL pool as well as accelerators (domain specific processors) that do some operations closer to where the memory lives. So, we think there’s a lot of possibilities in that pooling for the development and growth of emerging memories as well as conventional memories. Q: Do you think molecular-based technologies (DNA or others) can emerge in the coming years as an alternative to some of the semiconductor-based memories? A: DNA and other memory technologies are in a relatively early stage but there are people who are making fairly aggressive plans on what they can do with those technologies. We think the initial market for those molecular memories are not in this high performance memory application; but especially with DNA, the potential density of storage and the fact that you can make lots of copies of content by using genetic genomic processes makes them very attractive potentially for archiving applications.  The things we’ve seen are mostly in those areas because of the performance characteristics. But the potential density that they’re looking at is actually aimed at that lower part of the market, so it has to be very, very cost effective to be able to do that, but the possibilities are there.  But again, as with the emerging high performance memories, you still have the economies of scale you have to deal with – if you can’t scale it fast enough the cost won’t go down enough that will actually will be able to compete in those areas. So it faces somewhat similar challenges, though in a different part of the market. Earlier in the webcast, we said when showing the orb chart, that for something to fit into the computing storage hierarchy it has to be cheaper than the next faster technology and faster than the next cheaper technology. DNA is not a very fast technology and so that automatically says it has to be really cheap for it to catch on and that puts it in a very different realm than the emerging memories that we’re talking about here. On the other hand, you never know what someone’s going to discover, but right now the industry doesn’t know how to make fast molecular memories. Q:  What is your intuition on how tomorrow’s highly dense memories might impact non-load/store processing elements such as AI accelerators? As model sizes continue to grow and energy density becomes more of an issue, it would seem like emerging memories could thrive in this type of environment. Your thoughts? A:  Any memory would thrive in an environment where there was an unbridled thirst for memory. as artificial intelligence (AI) currently is. But AI is undergoing some pretty rapid changes, not only in the number of the parameters that are examined, but also in the models that are being used for it. We recently read a paper that was written by Apple* where they actually found ways of winnowing down the data that was used for a large language model into something that would fit into an Apple MacBook Pro M2 and they were able to get good performance by doing that.  They really accelerated things by ignoring data that didn’t really make any difference. So, if you take that kind of an approach and say: “Okay.  If those guys keep working on that problem that way, and they take it to the extreme, then you might not need all that much memory after all.”  But still, if memory were free, I’m sure that there’d be a ton of it out there and that is just a question of whether or not these memories can get cheaper than DRAM so that they can look like they’re free compared to what things look like today. There are three interesting elements of this:  First, CXL, in addition allowing mixing of memory types, again allows you to put in those domain specific processors as well close to the memory. Perhaps those can do some of the processing that’s part of the model, in which case it would lower the energy consumption. The other thing it supports is different computing models than what we traditionally use. Of course there is quantum computing, but there also is something called neural networks which actually use the memory as a matrix multiplier, and those are using these emerging memories for that technology which could be used for AI applications.  The other thing that’s sort of hidden behind this is that spin tunnelling is changing processing itself in that right now everything is current-based, but there’s work going on in spintronic based devices that instead of using current would use the spin of electrons for moving data around, in which case we can avoid resistive heating and our processing could run a lot cooler and use less energy to do so.  So, there’s a lot of interesting things that are kind of buried in the different technologies being used for these emerging memories that actually could have even greater implications on the development of computing beyond just the memory application themselves.  And to elaborate on spintronics, we’re talking about logic and not about spin memory – using spins rather than that of charge which is current. Q:  Flash has an endurance issue (maximum number of writes before it fails). In your opinion, what is the minimum acceptable endurance (number of writes) that an emerging memory should support? It’s amazing how many techniques have fallen into place since wear was an issue in flash SSDs.  Today’s software understands which loads have high write levels and which don’t, and different SSDs can be used to handle the two different kinds of load.  On the SSD side, flash endurance has continually degraded with the adoption of MLC, TLC, and QLC, and is sometimes measured in the hundreds of cycles.  What this implies is that any emerging memory can get by with an equally low endurance as long as it’s put behind the right controller. In high-speed environments this isn’t a solution, though, since controllers add latency, so “Near Memory” (the memory tied directly to the processor’s memory bus) will need to have higher endurance.  Still, an area that can help to accommodate that is the practice of putting code into memories that have low endurance and data into higher-endurance memory (which today would be DRAM).  Since emerging memories can provide more bits at a lower cost and power than DRAM, the write load to the code space should be lower, since pages will be swapped in and out more frequently.  The endurance requirements will depend on this swapping, and I would guess that the lowest-acceptable level would be in the tens of thousands of cycles. Q: It seems that persistent memory is more of an enterprise benefit rather than a consumer benefit. And consumer acceptance helps the advancement and cost scaling issues. Do you agree? I use SSDs as an example. Once consumers started using them, the advancement and prices came down greatly. Anything that drives increased volume will help.  In most cases any change to large-scale computing works its way down to the PC, so this should happen in time here, too. But today there’s a growing amount of MRAM use in personal fitness monitors, and this will help drive costs down, so initial demand will not exclusively come from enterprise computing. At the same time, the IBM FlashDrive that we mentioned uses MRAM, too, so both enterprise and consumer are already working to simultaneously grow consumption. Q: The CXL diagram (slide 22 in the PDF) has 2 CXL switches between the CPUs and the memory. How much latency do you expect the switches to add, and how does that change where CXL fits on the array of memory choices from a performance standpoint? The CXL delay goals are very aggressive, but I am not sure that an exact number has been specified.  It’s on the order of 70ns per “Hop,” which can be understood as the delay of going through a switch or a controller. Naturally, software will evolve to work with this, and will move data that has high bandwidth requirements but is less latency-sensitive to more remote areas, while managing the more latency-sensitive data to near memory. Q: Where can I learn more about the topic of Emerging Memories? Here are some resources to review   * LLM in a Flash: Efficient Large Language Model Inference with Limited Memory, Kevin Avizalideh, et. al.,             arXiv:2312.11514 [cs.CL] The post Emerging Memories Branch Out – a Q&A first appeared on SNIA Compute, Memory and Storage Blog.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Here’s Why Ceph is the Linux of Storage Today

Erin Farr

Feb 14, 2024

title of post
Data is one of the most critical resources of our time. Storage for data has always been a critical architectural element for every data center, requiring careful considerations for storage performance, scalability, reliability, data protection, durability and resilience. A decade ago, the market was aggressively embracing public storage because of its agility and scalability. In the last few years, people have been rethinking that approach, moving toward on-premises storage with cloud consumption models. The new cloud native architecture on-premises has the promise of the traditional data center’s security and reliability with cloud agility and scalability. Ceph, an Open Source project for enterprise unified software-defined storage, represents a compelling solution for this cloud native on-premises architecture and will be the topic of our next SNIA Cloud Storage Technologies Initiative webinar, “Ceph: The Linux of Storage Today.” This webinar will discuss:
  • How Ceph targets important characteristics of modern software-defined data centers
  • Use cases that illustrate how Ceph has evolved, along with future use cases
  • Quantitative data points that exemplify Ceph’s community success
We will describe how Ceph is gaining industry momentum, satisfying enterprise architectures’ data storage needs and how the technology community is investing to enable the vision of “Ceph, the Linux of Storage Today.” Register today to join us for this timely discussion.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to