Blog

Intro to AL and ML Q&A

Intro to AL and ML Q&A

SNIA Data, Networking & Storage Community

Sep 8, 2025

Last month, the SNIA Data, Storage & Networking (DSN) Community launched an in-depth, educational “AI Stack” webinar series. It’s an ambitious project that continues to grow, covering multiple aspects of artificial intelligence (AI). A sample of topics is highlighted at the end of this blog.

The first webinar, “An Introduction to AI and Machine Learning,” set the stage for the series. It provided a foundational understanding of AI and Machine Learning (ML). We explored the basic concepts, history, applications of AI/ML, and a live demo of training a neural network to recognize written letters. The response from the audience was fantastic, with a 5.0 rating and lots of questions. Our experts, Tim Lustig, Justin Potuznik, Erik Smith and Jayanthi Ramakalanjiyam have answered them in this Q&A.

Q: Why are number of tokens used to determine usage?

A: In an inference system, the model is fixed in GPU memory, but how active it is really depends on how many tokens or queries you put into it. And as we said, tokens are kind of the lowest common denominator for what gets fed through the model. You take that input, you tokenize it, and then you feed it through the model. The model then outputs response tokens. That token input and output count will ultimately be the thing that determines how much work the system is doing. That's why you'll see many cloud-based systems charge by the token.

GenAI Modality

Performance Metric (Engineering Focus)

Billing Metric (Business/Usage Focus)

Text / Chat (LLMs)

• Tokens per second (throughput)• Time to first token (latency)• Batch throughput (tokens/sec across multiple queries)

• Tokens processed (input + output)• Sometimes flat subscription tiers (per seat for copilots)

Image Generation

• Images per second• Steps per second (diffusion steps)• VRAM usage at target resolution

• Per image generated• Cost scales by resolution (e.g., 256×256 vs 1024×1024)• Extras billed (variations, inpainting)

Audio Generation (speech/music)

• Seconds of audio per second (RTF = real-time factor)• Samples per second

• Per second of audio generated• Music models may charge per track length

Video Generation

• Frames per second generated• Time per clip• VRAM load at resolution

• Per second of video• Cost tiers by resolution (720p vs 4K) and length

3D / Simulation (NeRFs, CAD, game assets)

• Vertices/points processed per second• Render time per frame

• Per object generated• Complexity tiered (polygon count, resolution)

Code Generation

• Tokens/sec (under the hood)• Problems solved per minute• Test cases passed

• Per token (standard)• Sometimes per developer seat (e.g., GitHub Copilot)

RAG / Multimodal Retrieval

• Latency per query (ms)• Queries per second (QPS)• Vector index size handled

• Per query• Sometimes scaled by docs retrieved or vector DB size

Agentic AI / Automation

• Tasks completed per hour• End-to-end latency

• Per workflow run• Sometimes per action/API call

Q: What kind of storage system is optimal for AI/ML workloads?

A: To know what kind of storage system is optimal the first question you have to answer is are you looking to build an AI system for training or for inference? You generally have to do one or the other and the storage requirements are different for training versus inference. For training the biggest thing you want to do is have high bandwidth storage with good parallel write capabilities for the training process. We'll discuss this more in the next session. You spend a noticeable percentage of your time doing checkpointing during training. Checkpointing is when you freeze your weights as you're training it and copy all that data out of GPU memory and into storage. That way if something happens, or if you end your training session, you have your latest version of the model in nonvolatile storage. Especially as the training environment size increases, you're writing out all the data from each GPU’s memory simultaneously. So, the quicker you can do that, and handle many different GPUs all trying to write to the same place you can noticeably improve your performance. On the inference side, you almost have the opposite problem. Instead of a structured set of demands on the storage system, where every so many minutes it's going to dump a fixed amount of data from a fixed number of nodes to it and then just do reads the rest of the time. Now you're kind of in a constant read/write battle with a changing I/O pattern and size. You're retrieving from different systems. It's much more about IO & latency sensitive. A previous webinar “AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads” provides a deep dive on this topic and would be a great resource to if you want to get into this topic in more detail.

Q: Earlier example for application in search was "puppies want to eat...", and the guesses in the screen shot came back as "puppy wants to eat...". This type of swap sometimes changes context and quality of search results. Is there a computational or financial advantage to displaying something already queried & answered instead of using the actual human input?

A: What you are pointing out here is an example of two things: similarity searching and answer caching. Both techniques are used to speed up response time while minimizing compute requirements and therefore cost, all with a minimum of quality loss in the answer. If a system can match to an existing answer or next token based on what it has in cache that is far less computationally expensive than actually running the query tokens through the model and generating new response tokens.

Q: In general, will each input neuron receive simultaneous independent feedback from the hidden layers?

A: Input neurons do not receive feedback from the hidden layers, they are used to store the input value only. During training, the first layer of hidden neurons (which are connected to all of the input neurons) will receive the input values and multiply each input value by the weight associated with the connection from the input neuron to that specific hidden neuron. The resulting products from each connection are summed along with the bias value for that hidden neuron before an activation function (e.g., Sigmoid) is applied to the value. The result is then passed to the next layer (in the case of the demo this would be the output layer). The output neuron follows a similar process, in that it multiplies the value of each hidden neuron by the weight of the connection to that output neuron. These values are summed along with the bias value and then an activation function is applied to the result. In our demo, we used Softmax. The resulting value is compared to the expected value and a loss function is applied (we used cross-entropy loss for the demo). Once the loss has been calculated, the error value is first used to calculate a gradient and the gradient is used to adjust the weights and bias of all output neurons and then finally, propagated back to the hidden layer so they may update their weights and bias values. All output neurons are updated at the same point in the training process followed by the updates to the hidden neurons. If you really want to dig into this in more detail, check out the Binary Digit Trainer app!

Q: For comparison, how many inputs and layers is ChatGPT?

A: The demo we reviewed was a simple feed-forward neural network, ChatGPT uses a transformer-based neural network. Since the exact details about ChatGPT’s latest architectures are not public, we asked ChatGPT if it could provide a ballpark estimate for comparison purposes. Here’s the response:

🔹 Binary Digit Trainer

Type: Feedforward Neural Network
Layers: 3 total
- Input layer: 81 neurons
- Hidden layer: 24 neurons
- Output layer: 2 neurons
Purpose: Simple binary classification (e.g., recognizing patterns in binary digits)

🔹 ChatGPT (based on GPT-4)

Type: Transformer-based Neural Network
Layers: Estimated 120–160 transformer blocks
Inputs:
- Each input is a token (a word or part of a word)
- Can process sequences of thousands of tokens (e.g., 8,000 to 32,000 tokens depending on the version)
Neurons per layer: Each layer has thousands of hidden units (e.g., ~12,288 in GPT-3)
Purpose: Natural language understanding and generation across a wide range of tasks (conversation, summarization, coding, reasoning, etc.)

Q: In the tiny network demo, can you show the logic in one of the nodes. Is the logic the same in each node? And is it the same in the hidden and output nodes?

A: Yes, Binary Digit Trainer allows you to follow the training process and see the calculations being used at each step. If you take the Guided Tour by clicking on the “Take Guided Tour” button in the upper right corner, it will step you through one complete training iteration. There are also “?” icons sprinkled throughout the app that provide much more detail. Here’s a link to Binary Digit Trainer

For the second part of your question, they are different. If you look at the output and the different output neurons, there are two of them. They're almost the inverse of one another. There are inhibitory and excitatory groups. They're almost literally flipped and that comes as a result of that error calculation and back propagation.

Q: In the demo, I did not fully understand how back propagation and gradient descent was implemented?

A: Binary Digit Trainer allows you to follow the training process and see the calculations being used at each step (this includes how back propagation is done). If you take the Guided Tour by clicking on the “Take Guided Tour” button in the upper right corner, it will step you through one complete training iteration. There are also “?” icons sprinkled throughout the app that provide much more detail. Here’s a link to Binary Digit Trainer.

Q: For the model in the demo, which resource is most utilized? CPU, memory, or storage?

A: The model and the training data are all running in memory and do not require persistent storage. We did add the ability to checkpoint the model and save it to persistent storage and also included a way to load the model into memory from persistent storage, but this isn’t strictly required.

That said, the App is very much CPU bound due to the extensive mathematical operations performed for each training sample. Memory utilization for the app is about 1 MB.

Q: Using the demo, what's the RAG? If the RAG is not defined, hallucinations can happen often?

A: RAG or Retrieval Augmented Generation will be discussed in future sessions. With regards to the demo, it doesn’t use RAG and really doesn’t need it for this use case.

Q: Do you need a GPU to do AI? Do CPUs have the ability to do AI?

A: Both CPUs and GPUs can support AI models. As models expand in size and capability GPUs generally become highly recommended. In future parts of our “AI Stack” webinar series, we'll dive into scale and deployment and cover some of this.

Q: Is there any formula to calculate the number of neurons or number of layers for a specific problem?

A: There are rules of thumb that help with that. For example, for the binary digit trainer used for the demo, we did some experimentation and found that we needed 81 input neurons because we had a 9 by 9 input. For the hidden layer, based on the results we were seeing, it seemed like the sweet spot was going to be somewhere between 24 and 40. We chose 24 because it fit nicely on the screen for the demo. There wasn’t that big of a difference between the accuracy of the model when we used 24 versus 40... That said, in general the answer is no because there are so many different kinds of problems. As you experiment, you’ll find AI requires a lot of experimentation at every level. The nice part is with so much in the public domain, you can see a lot of models that do the same tasks end up in about the same size range. How large is it while still maintaining accuracy and quality can be one of the differentiators between one model versus another.

Q: Really good AI 101 presentation for beginners. Question: How is storage changing with AI explosion. In particular SSD storage. What are the use cases?

A: The AI explosion has made a huge impact in the storage roadmap. The models like deep learning and LLMs (Large Language Models) have some primary requirements, including:

Scale of data: Scale of data is changing from GigaBytes to TeraBytes to PetaBytes and now to ExaBytes, for storing training data, checkpointing (intermediate snapshot of trained model) and to store the fully trained models. Typically, file and object storage are used for in these scenarios- for “data at rest”. The physical medium can be HDDs or SSDs or magnetic tape depending upon the priority factors like cost, speed, durability etc
The need for low latency and high performance: Faster data access is required during:
Training phase to satisfy the requirements of multiple GPUs running in parallel
Inference phase, driven mainly by RAG, KV cache requirements in the actual deployment. Typically, “AFAs (All Flash Arrays)” are used for this purpose, for “data in use”. The medium is SSDs.

Hence the SSD landscape is evolving faster coupled with technologies like NVMe-oF and CSDs (Computational storage devices)

To understand more about the role of storage in optimizing AI workloads, please watch this earlier webinar from SNIA, AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads.

Q: Do AI systems ever reset back to some point in time and have to re-learn some things?

A: Yes, that's why we have checkpointing. You could be running a training job for weeks and encounter an error and you would basically have to rewind your training to before the error. That's why we use checkpoints so that you can checkpoint at different frequencies. Some people checkpoint every couple of hours, but the idea is to find a sweet spot between recovery time and the amount of time it takes to take the checkpoint.

Q: Are HDDs of any use in AI storage systems or is it pure SSDs?

A: It really depends on the use case. For the things that need high throughput with lots of training data coming, images for example, latency and throughput are going to be critical and that would point to SSDs or something that gives you really low latency and high throughput. HDDs can be useful for taking a checkpoint and storing it, maybe eventually reloading it, but typically latency and throughput are incredibly important with these types of environments.

Q: What methods or mechanisms are available to prevent injecting incorrect data into the training models (e.g., deliberate malware)?

A: When preventing incorrect or "bad" data from getting into AI training models, several straightforward methods can be employed. Ensuring the quality of data used is crucial, and this can be achieved by using trusted data sources, automating checks with human oversight, and educating teams on data quality importance. Key strategies include data validation and sanitizing, anomaly and outlier detection, monitoring data sources, and regular model testing. By implementing these practices, the risk of "bad" data affecting AI models can be significantly reduced, protecting them from both unintentional mistakes and deliberate attempts to mislead. There are also many companies specializing in providing tools and services for data integrity and AI security, focusing on data preparation and validation.

Q: What is vectorization?

A: Vectorization converts structured data into numerical vectors, enabling similarity searches. It's basically the process of taking data from the way a human would read it and turning it into how the machine can look at it from a relationship perspective. We will discuss RAG & vectorization further in other parts of the upcoming sessions!

Q: How can biases be introduced in the algorithm?

A: Bias can be introduced at several different stages of the training process. The table below provides some examples.

Stage	How Bias is Introduced	Examples
Training Data	Over/under-representation, historical/social bias, skewed sampling	Face datasets skewed toward lighter skin tones; English-dominant text corpora
Data Labeling	Human annotators bring personal/cultural perspectives	Political text labeled as “toxic” depending on annotator’s worldview
Model Objectives	Optimization for prediction/loss amplifies majority patterns	Language models learning stereotypes because they reduce prediction error
Fine-Tuning (RLHF)	Human raters’ judgments shape model outputs	Reinforcement to prefer “polite” or “safe” answers, embedding cultural/political leanings
Deployment Filters	Provider moderation and policy guardrails	Blocking certain topics, rephrasing answers to fit corporate values
Feedback Loops	User feedback reinforces dominant group perspectives	Heavier feedback from one demographic shifts model alignment toward their views

Beyond this, even the language the model is trained on can introduce cultural biases. This is one of the reasons that Sovereign AI is rapidly becoming a thing. For example, take a look at articles written on “Malaysia’s Sovereign AI Strategy”.

This is a Series!

We have an ambitious line-up of webinars in this "AI Stack" webinar series that continues to grow. Here are topics planned:

Introduction to AI and Machine Learning
Understanding Model Training
Model Inferencing and Deployment Options
Impact of AI on Network Interconnects
Parallelism in AI (Model, Data, Tensor)
Collective Communication Libraries (NCCL and RCCL)
In-Network Collective Operations (SHARP and UET)
MLOps Frameworks
AI Infrastructure
Management and Orchestration
Security Considerations for AI

Learn more about the AI Stack series and questions from this webinar in this SNIA Experts on Data Podcast Interview and follow us on LinkedIn and X for upcoming dates.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Blog

Ethernet in the Age of AI Q&A

Ethernet in the Age of AI Q&A

Raguraman Sundaram

Jan 11, 2025

AI is having a transformative impact on networking. It’s a topic that the SNIA Data, Storage & Networking Community covered in our live webinar, “Ethernet in the Age of AI: Adapting to New Networking Challenges.” The presentation explored various use cases of AI, the nature of traffic for different workloads, the network impact of these workloads, and how Ethernet is evolving to meet these demands. The webinar audience was highly engaged and asked many interesting questions. Here are the answers to them all. Q. What is the biggest challenge when designing and operating an AI Scale out fabric? A. The biggest challenge in designing and operating an AI scale-out fabric is achieving low latency and high bandwidth at scale. AI workloads, like training large neural networks, demand rapid, synchronized data transfers between thousands of GPUs or accelerators. This requires specialized interconnects, such as RDMA, InfiniBand, or NVLink, and optimized topologies like fat-tree or dragonfly to minimize communication delays and bottlenecks. Balancing scalability with performance is critical; as the system grows, maintaining consistent throughput and minimizing congestion becomes increasingly complex. Additionally, ensuring fault tolerance, power efficiency, and compatibility with rapidly evolving AI workloads adds to the operational challenges. Unlike standard data center networks, AI fabrics handle intensive east-west traffic patterns that require purpose-built infrastructure. Effective software integration for scheduling and load balancing is equally essential. The need to align performance, cost, and reliability makes designing and managing an AI scale-out fabric a multifaceted and demanding task. Q. What are the most common misconceptions about AI scale-out fabrics? A. The most common misconception about AI scale-out fabrics is that they are the same as standard data center networks. In reality, AI fabrics are purpose-built for high-bandwidth, low-latency, east-west communication between GPUs, essential for workloads like large language model (LLM) training and inference. Many believe increasing bandwidth alone solves performance issues, but factors like latency, congestion control, and topology optimization (e.g., fat-tree, dragonfly) are equally critical. Another myth is that scaling out is straightforward—adding GPUs without addressing communication overhead or load balancing often leads to bottlenecks. Similarly, people assume all AI workloads can use a single fabric, overlooking differences in training and inference needs. AI fabrics also aren’t plug-and-play; they require extensive tuning of hardware and software for optimal performance. Q. How do you see the future of AI Scale-out fabrics evolving over the next few years? A. AI scale-out fabrics is going to have more and more Ethernet. Ethernet-based fabrics, enhanced with technologies like RoCE (RDMA over Converged Ethernet), will continue to evolve to deliver the low latency and high bandwidth required for large-scale AI applications, particularly in training and inference of LLMs. Emerging standards like Ethernet 800GbE and beyond will provide the throughput needed for dense, GPU-intensive workloads. Advanced congestion management techniques, such as DCQCN, Multipathing, Packet trimming etc, will improve performance in Ethernet-based fabrics by reducing packet loss and latency. Ethernet's cost-effectiveness, ubiquity, and compatibility with hybrid environments will make it a key enabler for AI scale-out fabrics in both cloud and on-premises deployments. The convergence of CXL over Ethernet may eventually enable memory pooling and shared memory access across components within scale-up systems, supporting the increasing memory demands of LLMs. The need for having Ethernet for scale-up is going to be on the rise as well. Q. What are the best practices for staying updated with the latest trends and developments? Can you recommend any additional resources or readings for further learning? A. There are several papers and research articles on the internet, some of them are listed in the webinar slide deck. Following Ultra Ethernet Consortium and SNIA are the best ways to learn about networking related updates. Q. Is NVLink a standard? A. No, NVLink is not an open standard. It is a proprietary interconnect technology developed by NVIDIA. It is specifically designed to enable high-speed, low-latency communication between NVIDIA GPUs and, in some cases, between GPUs and CPUs in NVIDIA systems. Q. What's the difference between collections and multicast? A. It is tempting to think that collections and multicast are similar, for example the collectives like Broadcast. But they are in principle different and address different requirements. Collections are high-level operations for distributed computing, while multicast is a low-level network mechanism for efficient data transmission. Q. What's the support lib/tool/kernel module for enabling Node1 GPU1-> Node2 GPU2->GPU fabric -> Node2 GPU2? It seems some Host level knowledge, not TOR level. A. Yes, the topology discovery and optimal path for routing the GPU messages from the source depends on the Host software and is not TOR dependent. The GPU applications end up using the MPI APIs for communication between the nodes in the cluster. These MPI APIs are made aware of the GPU topologies by the respective extension libraries provided by the GPU vendor. For instance, NVIDIA's NCCL and AMD's RCCL libraries provide option to mention static GPU topology in the system through an XML file (via NCCL_TOPO_FILE or RCCL_TOPO_FILE) that can be loaded when initializing the stack. The MPI GPU aware library extensions from NVIDIA/AMD would then leverage this provided topology information to send the messages to the appropriate GPU. An example NCCL topology is here: https://github.com/nebius/nccl-topology/blob/main/nccl-topo-h100-v1.xml. There are utilities such as nvidia-smi/rocm-smi that are used in the initial discovery. The automatic topology detection and calculation of optimal paths for MPI could be made available as part of GPU vendor's CCL library as well. For instance, NCCL provides such functionality by reading the /sys from the host and building PCI topology of GPU/NICs. The SNIA Data, Storage & Networking Community provides vendor-neutral education on a wide range of topics. Follow us on LinkedIn and @SNIA for upcoming webinars, articles, and content.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI AI ethernet

Blog

SNIA Fosters Industry Knowledge of Collaborative Standards Engagements

SNIA Fosters Industry Knowledge of Collaborative Standards Engagements

SNIA CMS Community

Nov 26, 2024

November 2024 was a memorable month to engage with audiences at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 24 and Technology Live! to provide the latest on collaborative standards development and discuss high performance computing, artificial intelligence, and the future of storage. At SC24, seven industry consortiums participated in an Open Standards Pavilion to discuss their joint activities in memory and interconnect standards, storage standards, networking fabric standards, and management and orchestration.

Technology leaders from DMTF, Fibre Channel Industry Association, OpenFabrics Alliance, SNIA, Ultra Accelerator Link Consortium, Ultra Ethernet Consortium, and Universal Chiplet Interconnect Express™ Consortium shared how these standards are collaborating to foster innovation as technology trends accelerate. CXL® Consortium, NVM Express®, and PCI-SIG® joined these groups in a lively panel discussion moderated by Richelle Ahlvers, Vice Chair SNIA Board of Directors, on their cooperation in standards development.

With the acceleration of AI and HPC technologies, industry standards bodies are essential to guarantee interoperability and facilitate faster deployment. Collaboration among industry standards groups fosters the development and deployment of advanced HPC solutions, driving innovation, collaboration, and efficiency. Joint activities from these industry associations include memory and interconnect standards, storage standards, networking fabric standards, and management and orchestration. During SC24, SNIA engaged with analysts, partners, member companies, manufacturers, and end users to provide updates on their latest technical activities. SNIA Compute, Memory, and Storage Initiative discussed computational storage work from SNIA and NVM Express and new opportunities for audiences interested in CXL to program CXL memory modules in the SNIA Innovation Lab. SNIA Swordfish® discussed their collaboration with DMTF Redfish, OFA Sunfish, NVMe, and CXL on a unified approach to open storage management. SNIA SFF Technology Affiliate presented their technical releases in SSD E1, E3, U.2 and M.2 form factor standards. The SNIA STA Forum showcased 24G SAS products highlighting the technology's cutting-edge capabilities, discussing the benefits of SAS for high-performance computing environments and SAS's critical role in delivering reliability, scalability, and performance for modern data-driven applications. Check out our post-event video on LinkedIn! At Technology Live! in London, the SNIA STA Forum shared insights with editors, analysts and influencers on the future of storage. [caption id="attachment_3945" align="alignleft" width="300"]

Photo Credit: A3 Communications[/caption] STA Chair Cameron T. Brett attended the event to represent SNIA, fostering an informed and balanced conversation within the industry. Later in the day, he delivered a comprehensive update on the latest advancements in SAS technology, including 24G+ SAS developments, recent tech enhancements, and the updated SAS Roadmap. His presentation also highlighted new market data, and explored innovative applications such as SAS in space. A lively discussion around AI and its transformative impact on storage further demonstrated SAS's ability to meet the demands of emerging technologies. These conversations reinforced SAS's vital role in shaping next-generation data infrastructures. Watch the YouTube video of Cameron Brett’s SAS Presentation. Looking forward to even more engagements in 2025! If you have not already, subscribe to SNIA Matters for the latest on ongoing SNIA activities and events!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI Computational Storage CXL EDSFF Serial Attached SCSI (SAS) Swordfish Solid State Storage

Blog

Storage for AI Q&A

Storage for AI Q&A

Jayanthi Ramakalanjiyam

Nov 18, 2024

Our recent SNIA Data, Networking & Storage Forum (DNSF) webinar, “AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads,” was an insightful look at how AI workloads interact with storage at every stage of the AI data pipeline with a focus on data loading and checkpointing. Attendees gave this session a 5-star rating and asked a lot of wonderful questions. Our presenter, Ugur Kaynar, has answered them here. We’d love to hear your questions or feedback in the comments field. Q. Great content on File and Object Storage, Are there any use cases for Block Storage in AI infrastructure requirements? A. Today, by default, AI frameworks cannot directly access block storage, and need a file system to interact with block storage during training. Block storage provides raw storage capacity, but it lacks the structure needed to manage files and directories. Like most AI frameworks, PyTorch depends on a file system to manage and access data stored on block storage. Q. Do high speed networks make some significant enhancements to I/O and checkpointing process? A. High-speed networks enable faster data transfer rates and the I/O bandwidth can be better utilized which can significantly reduce the time required to save checkpoints. This minimizes downtime and helps maintain system performance. However, it is important to keep in mind that the performance of checkpointing depends on both the storage network and the storage system. It’s essential to maintain a balance between the two for optimal results. If the network is fast but the storage system is slow, or vice versa, the slower component will create a bottleneck. This imbalance can lead to inefficiencies and longer checkpointing times. When both the network and storage systems are balanced, data can flow smoothly between them. This maximizes throughput, ensuring that data is written to storage as quickly as it is transferred over the network. Q. What is the rule of thumb or range of storage throughput per GPU? A. Please see answer below. Q. What is the typical IO performance requirements for AI training in terms of IOs per second, bytes per second? A. The storage throughput per GPU can vary based on the specific workload and the performance requirements of the AI model being trained. Models processing text data typically need throughput ranging from a few MB/s to hundreds of MB/s per GPU. In contrast, more demanding models that handle image or video data require higher throughput, often around a few GB/s, due to the larger sample sizes. According to the latest MLPerf Storage benchmark results, the 3D-Unet medical image segmentation model requires approximately 2.8 GB/s storage throughput to maintain 90% utilization of H100 GPUs. Benchmark MLPerf Storage | MLCommons V1.1 Results Storing the checkpoint data for large models requires significant storage throughput in the range of several GB/s per GPU Q. Do you see in this workflow a higher demand for throughput from the storage layer? Or with random operations more demand for IOPs? How do the devices have to change to accommodate AI? A. Typically, random IO operations have higher IOPs requirements than throughput. Q. How frequently are checkpoints created and then immediately read from? Is there scope for a checkpointed cache for such immediate reads? A. In AI training, checkpoints are usually generated at set intervals, which can differ based on the specific needs of the training process. For large-scale models, checkpoints might be saved every few minutes or after a certain number of iterations/steps to minimize data loss. Immediate reads from these checkpoints often happen when resuming training after a failure or during model evaluation for validations checks. Implementing a checkpoint cache can be highly advantageous given the frequency of these operations. By storing the most recent checkpoint data, such a cache can facilitate quicker recovery, reduce wait times, and enhance overall training efficiency. Q. How does storage see the serialized checkpointing write? Is it a single thread/job to an individual storage device? A. The serialized checkpoint data is stored as large sequential blocks by a single writer. Q. Do you see something like CXL memory helping with checkpointing? A. Eventually. CXL provides two primary use cases: extending and sharing system memory, and integrating accelerators into the CPU’s coherent link. From an AI checkpointing perspective, CXL could act as an additional memory tier for storing checkpoints. Q. Great presentation, thanks. Can you touch on how often these large models are checkpointed. Is it daily? Hourly? Also, what type of storage is preferred for checkpointing — SSD or HDD, or a mix of both — and are the checkpoints saved indefinitely? I’m trying to understand if checkpointing is a storage pain point with regard to having enough storage capacity on hand? A. For checkpointing, the preferred storage type is high speed flash storage (NVMe SSD) due to its high performance during the training. The frequency of checkpointing for large AI models can vary based on several factors, including the model size, training duration, and the specific requirements of the training process. Therefore, it is difficult to generalize. For example, Meta has reported that they perform checkpointing at 30-minute intervals for recommendation models. (nsdi22-paper-eisenman.pdf) However, a common guideline is to keep the time spent on checkpointing to less than 5% of your training time. This ensures that the checkpointing process does not significantly impact the overall training efficiency while still providing sufficient recovery points in case of interruptions. If the model runs multiple epochs, checkpoints are typically saved after each epoch, which is one complete pass through the training dataset. This practice ensures that you have recovery points at the end of each epoch. Another common approach to do checkpointing at regular intervals (certain number of iterations or steps), such as every 500 iterations, for example: (Storage recommendations for AI workloads on Azure infrastructure (IaaS) – Cloud Adoption Framework | Microsoft Learn). Q. Why is serialization needed in checkpointing? A. Serialization ensures that the model’s state, including its parameters and optimizer states, is captured in a consistent manner. By converting the model’s state into a structured format, serialization allows for efficient storage and retrieval. Q. What is the difference between tensor cores and Matrix Multiplication Accelerator (MMA) Engines? A. Tensor Cores are highly specialized for AI and deep learning tasks, providing significant performance boosts for these specific workloads, while MMA Engines are more general-purpose and used across a broader range of applications. Q. Since the checkpoints are so large, do a lot of AI environments utilize tape to keep multiple copies of checkpoints? A. During training, checkpoints are typically written to redundant fast storage solutions like all-flash arrays. This ensures that the expensive GPUs are not left idle, waiting for data to be written or read. The checkpoint data is replicated to ensure durability and prevent data loss. Tape storage, on the other hand, is more suitable for archival purposes. Tapes can be used to store checkpoints long-term due to its cost-effectiveness, durability, and scalability. While it’s not ideal for the high-speed demands of active training, it excels in preserving data for future reference or compliance reasons. Q. Do you think S3/object will adopt something like RDMA for faster access to read/write data directly to GPU memory? A. Currently, there is no RDMA support on S3. However, the increasing usage of object storage indicates that object storage solutions will adopt similar optimization approaches as file systems like RDMA for faster access to read/write data. Q. Are checkpoints stored after training, or are they deleted automatically? A. Checkpoints are typically stored after training to allow for model recovery, fine-tuning, or further analysis. They are not deleted automatically unless explicitly configured to do so. This storage ensures that you can resume training from a specific point if needed, which is especially useful in long or complex training processes. Keep up with all that is going on at the SNIA Data, Networking & Storage Forum, follow us on LinkedIn and X @SNIA The post Storage for AI Q&A first appeared on SNIA on Data, Networking & Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI Object Storage

Blog

AIOps Q&A

AIOps Q&A

Pratik Gupta

Apr 25, 2024

Moving well beyond “fix it when it breaks,” AIOps introduces intelligence into the fabric of IT thinking and processes. The impact of AIOps and the shift in IT practices were the focus of a recent SNIA Cloud Storage Technologies Initiative (CSTI) webinar, “AIOps: Reactive to Proactive – Revolutionizing the IT Mindset.” If you missed the live session, it’s available on-demand together with the presentation slides, at the SNIA Educational Library. The audience asked several intriguing questions. Here are answers to them all: Q. How do you align your AIOps objectives with your company’s overall AI usage policy when it is still fairly restrictive in terms of AI use and acceptance? A.There are a lot of misconceptions on company policies and also what constitutes AI and the actual risk. So, there are several steps you can take:

Understand the policy and intent
Focus on low risk and high value use cases, for example, data used in IT management is often low risk and high value – e.g. metrics, or number of incidents or events
Start with a well-controlled and small environment and show value
Be transparent and demonstrate transparency. Even put human in the loop for a while.
Maintain data governance – responsible data handling.
Use industry’s best practices.

Q. What are the best AIOps tools in the market? A. There are many tools that claim to be an AIOps tool. But as the webinar shows, there is no single good tool and there will never be one best tool. It depends on what problem you are trying to solve.

Step 1: Identify the areas of the software development life cycle (SDLC) that you are focused on
Step 2: Identify the problem areas
Step 3: identify the tools that can help catch the problems earlier and solve them

Q. What kind of coding and tool experience is needed for AIOps? A. Different parts of the lifecycle require different levels of experience with coding or tools. Many don’t need any coding experience. However, a number of them require a thorough understanding of processes and best practices in software development or IT management to use them effectively. Q. How can a DevOps engineer upskill to AIOps? A. It is very easy for a DevOps engineer to upskill to use AIOps tools. A lot of these capabilities are available as open source. It is best to start experimenting with open-source tools and see their value. Second, focus on a smaller section of the problem (looking at the lifecycle) and then identify the tools that solve that problem. Free tiers, open-source tools, and even manual scripts help upskill without buying these tools. A lot of on-line course sites like Udemy are now offering AIOps classes as well. Q. What are examples of existing AI cloud cost optimization tools? There are 2 types of cloud cost optimization tools

ITOps tools – automate actions to optimize cost
FinOps tools – analyze and recommend actions to optimize cost.

The analysis tools are good at identifying issues but fall short of actually providing value unless you manually take action. The tools that automate provide value immediately but need greater buy in from the organization to allow a tool to take action. Some optimization tools available: Turbonomic from IBM, others are from Flexera, Apptio, Densify, AWS cost explorer, Azure Cost Management + Billing, some are built into the cloud providers. Q. Can you please explain runbooks further? A.Runbooks are a sequence of actions often coded as scripts that are used to automate the action or remediation in response to a problem or incident. These are pre-defined procedures. Usually, they are built out of a set of manual actions an operator takes and then codifies in the form of a procedure and then code.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI AI

Blog

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Erik Smith

Mar 25, 2024

Generative AI is front page news everywhere you look. With advancements happening so quickly, it is hard to keep up. The SNIA Networking Storage Forum recently convened a panel of experts from a wide range of backgrounds to talk about Gen AI in general and specifically discuss how dataflow bottlenecks can constrain Gen AI application performance well below optimal levels. If you missed this session, “Accelerating Generative AI: Options for Conquering the Dataflow Bottlenecks,” it’s available on-demand at the SNIA Educational Library. We promised to provide answers to our audience questions, and here they are. Q: If ResNet-50 is a dinosaur from 2015, which model would you recommend using instead for benchmarking? A: Setting aside the unfair aspersions being cast on the venerable ResNet-50, which is still used for inferencing benchmarks 😊, we suggest checking out the MLCommons website. In the benchmarks section you’ll see multiple use cases on Training and Inference. There are multiple benchmarks available that can provide more information about the ability of your infrastructure to effectively handle your intended workload. Q: Even if/when we use optics to connect clusters, there is a roughly 5ns/meter delay for the fiber between clusters. Seems like that physical distance limit almost mandates alternate ways of programming optimization to ‘stitch’ the interplay between data and compute? A: With regards to the use of optics versus copper to connect clusters, signals propagate through fiber and copper at about the same speed, so moving to an all-optical cabling infrastructure for latency reduction reasons is probably not the best use of capital. Also, even if there were a slight difference in the signal propagation speed through a particular optical or copper based medium, 5ns/m is small compared to switch and NIC packet processing latencies (e.g., 200-800 ns per hop) until you get to full metro distances. In addition, the software latencies are 2-6 us on top of the physical latencies for the most optimized systems. For AI fabrics data/messages are pipelined, so the raw latency does not have much effect. Interestingly, the time for data to travel between nodes is only one of the limiting factors when it comes to AI performance limitations and it’s not the biggest limitation either. Along these lines, there’s a phenomenal talk by Stephen Jones (NVIDIA) “How GPU computing works” that explains how latency between GPU and Memory impacts the overall system efficiency much more than anything else. That said, the various collective communication libraries (NCCL, RCCL, etc) and in network compute (e.g., SHARP) can have a big impact on the overall system efficiency by helping to avoid network contention. Q: Does this mean that GPUs are more efficient to use than CPUs and DPUs? A: GPUs, CPUs, AI accelerators, and DPUs all provide different functions and have different tradeoffs. While a CPU is good at executing arbitrary streams of instructions through applications/programs, embarrassingly parallelizable workloads (e.g., matrix multiplications which are common in deep learning) can be much more efficient when performed by GPUs or AI accelerators due to the GPUs’ and accelerators’ ability to execute linear algebra operations in parallel. Similarly, I wouldn’t use a GPU or AI accelerator as a general-purpose data mover, I’d use a CPU or an IPU/DPU for that. Q: With regards to vector engines, are there DPUs, switches (IB or Ethernet) that contain vector engines? A: There are commercially available vector engine accelerators but currently there are no IPUs/DPUs or switches that provide this functionality natively. Q: One of the major bottlenecks in modern AI is GPU to GPU connectivity. Ex. NVIDIA uses a proprietary GPU-GPU interconnect, At DGX-2 the focus was on 16 GPUs within a single box with NVSwitch, but then with A100 NVIDIA pulled this back to 8GPUs. But then expanded on that to a super-pod and a second level of switching to get to 256GPUS. How does NVlink, or other proprietary GPU to GPU interconnects address bottlenecks? And why has industry focused on an 8 GPU deployment vs a 16 GPU deployment resolution, given that LLMs are not training on 10's of thousands of GPUs? A: GPU-GPU interconnects all addresses bottlenecks in the same way that other high-speed fabrics do. GPU-GPU have direct connections featuring large bandwidth, optimized interconnect (point to point or parallel paths), and lightweight protocols. These interconnects have so far been proprietary and not interoperable across GPU vendors. The number of GPUs in a server chassis is dependent on many practical factors, e.g., 8 Gaudis per server leveraging standard RoCE ports provides a good balance to support training and inference. Q: How do you see the future of blending of memory and storage being enabled for generative AI workloads and the direction of "unified" memory between accelerators, GPUs, DPUs and CPUs? A: If by unified memory, you mean centralized memory that can be treated like a resource pool and be consumed by GPUs in place of HBM or by CPUs/DPUs in place of DRAM, then we do not believe we will see unified memory in the foreseeable future. The primary reason is latency. To have a unified memory would require centralization. Even if you were to constrain the distance (i.e., between the end-devices and the centralized memory) to be a single rack, the latency increase caused by the extra circuitry and physical length of the transport media (at 5ns per meter) could be detrimental to performance. However, the big problem with resource sharing is contention. Whether it be congestion in the network or contention at the centralized resource access point (interface), sharing resources requires special handling that will be challenging in the general case. For example, with 10 “compute” nodes attempting to access a pool of memory on a CXL Type 3 device, many of the nodes will end up waiting an unacceptably long period of time for a response. If by unified memory, you mean creating a new “capacity” tier of memory that is more performant than SSD and less performant than DRAM, then CXL Type 3 devices appear to be the way the industry will address that use case, but it may be a while before we see mass adoption. Q: Do you see the hardware design to more specialized into the AI/ML phases (training, inference, etc.)? But today's enterprise deployments you can have the same hardware performing several tasks in parallel. A: Yes, not only have specialized HW offerings (e.g., accelerators) already been introduced (such as in consumer laptops combining CPUs with inference engines), but also specialized configurations that have been optimized for specific use cases (e.g., inferencing) to be introduced as well. The reason is related to the diverse requirements for each use case. For more information, see the OCP Global Summit 23 presentation “Meta’s evolution of network AI” (specifically starting at time stamp 4:30). They describe how different use cases stress the infrastructure in different ways. That said, there is value in accelerators and hardware being able to address any of the work types for AI so that a given cluster can run whichever mix of jobs is required at a given time. Q: Google leaders like Amin Vahdat have been casting doubts on the possibility of significant acceleration far from the CPU. Can you elaborate further on positioning data-centric compute in the face of that challenge? A: This is a multi-billion-dollar question! There isn’t an obvious answer today. You could imagine building a data processing pipeline with data transform accelerators ‘far’ from where the training and inferencing CPU/accelerators are located. You could build a full “accelerator only” training pipeline if you consider a GPU to be an accelerator not a CPU. The better way to think about this problem is to consider that there is no single answer for how to build ML infrastructure. There is also no single definition of CPU vs accelerator that matters in constructing useful AI infrastructure solutions. The distinction comes down to the role of the device within the infrastructure. With emerging ‘chiplet’ and similar approaches we will see the lines and distinctions blur further. What is significant in what Vahdat and others have been discussing: fabric/network/memory construction plus protocols to improve bandwidth, limit congestion, and reduce tail latency when connecting the data to computational elements (CPU, GPU, AI accelerators, hybrids) will see significant evolution and development over the next few years.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI DPU GPU

Blog

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Erik Smith

Mar 25, 2024

, we suggest checking out the MLCommons website. In the benchmarks section you’ll see multiple use cases on Training and Inference. There are multiple benchmarks available that can provide more information about the ability of your infrastructure to effectively handle your intended workload. Q: Even if/when we use optics to connect clusters, there is a roughly 5ns/meter delay for the fiber between clusters. Seems like that physical distance limit almost mandates alternate ways of programming optimization to ‘stitch’ the interplay between data and compute? A: With regards to the use of optics versus copper to connect clusters, signals propagate through fiber and copper at about the same speed, so moving to an all-optical cabling infrastructure for latency reduction reasons is probably not the best use of capital. Also, even if there were a slight difference in the signal propagation speed through a particular optical or copper based medium, 5ns/m is small compared to switch and NIC packet processing latencies (e.g., 200-800 ns per hop) until you get to full metro distances. In addition, the software latencies are 2-6 us on top of the physical latencies for the most optimized systems. For AI fabrics data/messages are pipelined, so the raw latency does not have much effect. Interestingly, the time for data to travel between nodes is only one of the limiting factors when it comes to AI performance limitations and it’s not the biggest limitation either. Along these lines, there’s a phenomenal talk by Stephen Jones (NVIDIA) “How GPU computing works” that explains how latency between GPU and Memory impacts the overall system efficiency much more than anything else. That said, the various collective communication libraries (NCCL, RCCL, etc) and in network compute (e.g., SHARP) can have a big impact on the overall system efficiency by helping to avoid network contention. Q: Does this mean that GPUs are more efficient to use than CPUs and DPUs? A: GPUs, CPUs, AI accelerators, and DPUs all provide different functions and have different tradeoffs. While a CPU is good at executing arbitrary streams of instructions through applications/programs, embarrassingly parallelizable workloads (e.g., matrix multiplications which are common in deep learning) can be much more efficient when performed by GPUs or AI accelerators due to the GPUs’ and accelerators’ ability to execute linear algebra operations in parallel. Similarly, I wouldn’t use a GPU or AI accelerator as a general-purpose data mover, I’d use a CPU or an IPU/DPU for that. Q: With regards to vector engines, are there DPUs, switches (IB or Ethernet) that contain vector engines? A: There are commercially available vector engine accelerators but currently there are no IPUs/DPUs or switches that provide this functionality natively. Q: One of the major bottlenecks in modern AI is GPU to GPU connectivity. Ex. NVIDIA uses a proprietary GPU-GPU interconnect, At DGX-2 the focus was on 16 GPUs within a single box with NVSwitch, but then with A100 NVIDIA pulled this back to 8GPUs. But then expanded on that to a super-pod and a second level of switching to get to 256GPUS. How does NVlink, or other proprietary GPU to GPU interconnects address bottlenecks? And why has industry focused on an 8 GPU deployment vs a 16 GPU deployment resolution, given that LLMs are not training on 10’s of thousands of GPUs? A: GPU-GPU interconnects all addresses bottlenecks in the same way that other high-speed fabrics do. GPU-GPU have direct connections featuring large bandwidth, optimized interconnect (point to point or parallel paths), and lightweight protocols. These interconnects have so far been proprietary and not interoperable across GPU vendors. The number of GPUs in a server chassis is dependent on many practical factors, e.g., 8 Gaudis per server leveraging standard RoCE ports provides a good balance to support training and inference. Q: How do you see the future of blending of memory and storage being enabled for generative AI workloads and the direction of “unified” memory between accelerators, GPUs, DPUs and CPUs? A: If by unified memory, you mean centralized memory that can be treated like a resource pool and be consumed by GPUs in place of HBM or by CPUs/DPUs in place of DRAM, then we do not believe we will see unified memory in the foreseeable future. The primary reason is latency. To have a unified memory would require centralization. Even if you were to constrain the distance (i.e., between the end-devices and the centralized memory) to be a single rack, the latency increase caused by the extra circuitry and physical length of the transport media (at 5ns per meter) could be detrimental to performance. However, the big problem with resource sharing is contention. Whether it be congestion in the network or contention at the centralized resource access point (interface), sharing resources requires special handling that will be challenging in the general case. For example, with 10 “compute” nodes attempting to access a pool of memory on a CXL Type 3 device, many of the nodes will end up waiting an unacceptably long period of time for a response. If by unified memory, you mean creating a new “capacity” tier of memory that is more performant than SSD and less performant than DRAM, then CXL Type 3 devices appear to be the way the industry will address that use case, but it may be a while before we see mass adoption. Q: Do you see the hardware design to more specialized into the AI/ML phases (training, inference, etc.)? But today’s enterprise deployments you can have the same hardware performing several tasks in parallel. A: Yes, not only have specialized HW offerings (e.g., accelerators) already been introduced (such as in consumer laptops combining CPUs with inference engines), but also specialized configurations that have been optimized for specific use cases (e.g., inferencing) to be introduced as well. The reason is related to the diverse requirements for each use case. For more information, see the OCP Global Summit 23 presentation “Meta’s evolution of network AI” (specifically starting at time stamp 4:30). They describe how different use cases stress the infrastructure in different ways. That said, there is value in accelerators and hardware being able to address any of the work types for AI so that a given cluster can run whichever mix of jobs is required at a given time. Q: Google leaders like Amin Vahdat have been casting doubts on the possibility of significant acceleration far from the CPU. Can you elaborate further on positioning data-centric compute in the face of that challenge? A: This is a multi-billion-dollar question! There isn’t an obvious answer today. You could imagine building a data processing pipeline with data transform accelerators ‘far’ from where the training and inferencing CPU/accelerators are located. You could build a full “accelerator only” training pipeline if you consider a GPU to be an accelerator not a CPU. The better way to think about this problem is to consider that there is no single answer for how to build ML infrastructure. There is also no single definition of CPU vs accelerator that matters in constructing useful AI infrastructure solutions. The distinction comes down to the role of the device within the infrastructure. With emerging ‘chiplet’ and similar approaches we will see the lines and distinctions blur further. What is significant in what Vahdat and others have been discussing: fabric/network/memory construction plus protocols to improve bandwidth, limit congestion, and reduce tail latency when connecting the data to computational elements (CPU, GPU, AI accelerators, hybrids) will see significant evolution and development over the next few years. The post Q&A for Accelerating Gen AI Dataflow Bottlenecks first appeared on SNIA on Data, Networking & Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI DPU GPU

Blog

Hidden Costs of AI Q&A

Hidden Costs of AI Q&A

Erik Smith

Mar 14, 2024

At our recent SNIA Networking Storage Forum webinar, “Addressing the Hidden Costs of AI,” our expert team explored the impacts of AI, including sustainability and areas where there are potentially hidden technical and infrastructure costs. If you missed the live event, you can watch it on-demand in the SNIA Educational Library. Questions from the audience ranged from training Large Language Models to fundamental infrastructure changes from AI and more. Here are answers to the audience’s questions from our presenters. Q: Do you have an idea of where the best tradeoff is for high IO speed cost and GPU working cost? Is it always best to spend maximum and get highest IO speed possible? A: It depends on what you are trying to do If you are training a Large Language Model (LLM) then you’ll have a large collection of GPUs communicating with one another regularly (e.g., All-reduce) and doing so at throughput rates that are up to 900GB/s per GPU! For this kind of use case, it makes sense to use the fastest network option available. Any money saved by using a cheaper/slightly less performant transport will be more than offset by the cost of GPUs that are idle while waiting for data. If you are more interested in Fine Tuning an existing model or using Retrieval Augmented Generation (RAG) then you won’t need quite as much network bandwidth and can choose a more economical connectivity option. It’s worth noting that a group of companies have come together to work on the next generation of networking that will be well suited for use in HPC and AI environments. This group, the Ultra Ethernet Consortium (UEC), has agreed to collaborate on an open standard and has wide industry backing. This should allow even large clusters (1000+ nodes) to utilize a common fabric for all the network needs of a cluster. Q: We (all industries) are trying to use AI for everything. Is that cost effective? Does it cost fractions of a penny to answer a user question, or is there a high cost that is being hidden or eaten by someone now because the industry is so new? A: It does not make sense to try and use AI/ML to solve every problem. AI/ML should only be used when a more traditional, algorithmic, technique cannot easily be used to solve a problem (and there are plenty of these). Generative AI aside, one example where AI has historically provided an enormous benefit for IT practitioners is Multivariate Anomaly Detection. These models can learn what normal is for a given set of telemetry streams and then alert the user when something unexpected happens. A traditional approach (e.g., writing source code for an anomaly detector) would be cost and time prohibitive and probably not be anywhere nearly as good at detecting anomalies. Q: Can you discuss typical data access patterns for model training or tuning? (sequential/random, block sizes, repeated access, etc)? A: There is no simple answer as the access patterns can vary from one type of training to the next. Assuming you’d like a better answer than that, I would suggest starting to look into two resources:

Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.

Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info. A: Yes. You can access the video and a PDF of the presentations slides here. Q: Does this mean we're moving to fewer updates or write once (or infrequently) read mostly storage model? I'm excluding dynamic data from end-user inference requests. A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal. Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)? A: Fibre Channel is not a great fit for AI use cases for a few reasons:

With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.

Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard? So does this mean NVMe will win? (over FC or TCP?) Will other AI chip providers have to adopt their own GPUDirect-like protocol? A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”. Q: You've mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage? A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system. Q: What are the presenters' views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem? A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP. Sometimes the answer is yes. Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models? A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects. Q: Fundamentally -- Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations? A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets. Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide. A: The point of the SSD + HDD slide was to point out the introduction of SSDs:

dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.

Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective? A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses: Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you're using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments. Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads? A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure. Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)? A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI Machine Learning

Blog

Hidden Costs of AI Q&A

Hidden Costs of AI Q&A

Erik Smith

Mar 14, 2024

Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.

Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info. A: Yes. You can access the video and a PDF of the presentations slides here. Q: Does this mean we’re moving to fewer updates or write once (or infrequently) read mostly storage model? I’m excluding dynamic data from end-user inference requests. A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal. Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)? A: Fibre Channel is not a great fit for AI use cases for a few reasons:

With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.

Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard? So does this mean NVMe will win? (over FC or TCP?) Will other AI chip providers have to adopt their own GPUDirect-like protocol? A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”. Q: You’ve mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage? A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system. Q: What are the presenters’ views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem? A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP. Sometimes the answer is yes. Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models? A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects. Q: Fundamentally — Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations? A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets. Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide. A: The point of the SSD + HDD slide was to point out the introduction of SSDs:

dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.

Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective? A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses: Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you’re using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments. Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads? A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure. Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)? A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities. The post Hidden Costs of AI Q&A first appeared on SNIA on Network Storage.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

AI AI Machine Learning

Blog

AIOps: The Undeniable Paradigm Shift

AIOps: The Undeniable Paradigm Shift

Michael Hoard

Mar 4, 2024

AI has entered every aspect of today’s digital world. For IT, AIOps is creating a dramatic shift that redefines how IT approaches operations. On April 9, 2024, the SNIA Cloud Storage Technologies Initiative will host a live webinar, “AIOps: Reactive to Proactive – Revolutionizing the IT Mindset.” In this webinar, Pratik Gupta, one of the industry’s leading experts in AIOps, will delve beyond the tools of AIOps to reveal how AIOps introduces intelligence into the very fabric of IT thinking and processes, discussing:

From Dev to Production and Reactive to Proactive: Revolutionizing the IT Mindset: We’ll move beyond the “fix it when it breaks” mentality, embracing a future-proof approach where AI analyzes risk, anticipates issues, prescribes solutions, and learns continuously.
Beyond Siloed Solutions: Embracing Holistic Collaboration: AIOps fosters seamless integration across departments, applications, and infrastructure, promoting real-time visibility and unified action.
Automating the Process: From Insights to Intelligent Action: Dive into the world of self-healing IT, where AI-powered workflows and automation resolve issues and optimize performance without human intervention.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Subscribe to AI

Intro to AL and ML Q&A

Find a similar article by tags

Leave a Reply

Ethernet in the Age of AI Q&A

Find a similar article by tags

Leave a Reply

SNIA Fosters Industry Knowledge of Collaborative Standards Engagements

Find a similar article by tags

Leave a Reply

Storage for AI Q&A

Find a similar article by tags

Leave a Reply

AIOps Q&A

Find a similar article by tags

Leave a Reply

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Find a similar article by tags

Leave a Reply

Q&A for Accelerating Gen AI Dataflow Bottlenecks

Find a similar article by tags

Leave a Reply

Hidden Costs of AI Q&A

Find a similar article by tags

Leave a Reply

Hidden Costs of AI Q&A

Find a similar article by tags

Leave a Reply

AIOps: The Undeniable Paradigm Shift

Find a similar article by tags

Leave a Reply