Last month, the SNIA Data, Storage & Networking (DSN) Community launched an in-depth, educational “AI Stack” webinar series. It’s an ambitious project that continues to grow, covering multiple aspects of artificial intelligence (AI). A sample of topics is highlighted at the end of this blog.
The first webinar, “An Introduction to AI and Machine Learning,” set the stage for the series. It provided a foundational understanding of AI and Machine Learning (ML). We explored the basic concepts, history, applications of AI/ML, and a live demo of training a neural network to recognize written letters. The response from the audience was fantastic, with a 5.0 rating and lots of questions. Our experts, Tim Lustig, Justin Potuznik, Erik Smith and Jayanthi Ramakalanjiyam have answered them in this Q&A.
Q: Why are number of tokens used to determine usage?
A: In an inference system, the model is fixed in GPU memory, but how active it is really depends on how many tokens or queries you put into it. And as we said, tokens are kind of the lowest common denominator for what gets fed through the model. You take that input, you tokenize it, and then you feed it through the model. The model then outputs response tokens. That token input and output count will ultimately be the thing that determines how much work the system is doing. That's why you'll see many cloud-based systems charge by the token.
GenAI Modality | Performance Metric (Engineering Focus) | Billing Metric (Business/Usage Focus) |
Text / Chat (LLMs) | • Tokens per second (throughput)• Time to first token (latency)• Batch throughput (tokens/sec across multiple queries) | • Tokens processed (input + output)• Sometimes flat subscription tiers (per seat for copilots) |
Image Generation | • Images per second• Steps per second (diffusion steps)• VRAM usage at target resolution | • Per image generated• Cost scales by resolution (e.g., 256×256 vs 1024×1024)• Extras billed (variations, inpainting) |
Audio Generation (speech/music) | • Seconds of audio per second (RTF = real-time factor)• Samples per second | • Per second of audio generated• Music models may charge per track length |
Video Generation | • Frames per second generated• Time per clip• VRAM load at resolution | • Per second of video• Cost tiers by resolution (720p vs 4K) and length |
3D / Simulation (NeRFs, CAD, game assets) | • Vertices/points processed per second• Render time per frame | • Per object generated• Complexity tiered (polygon count, resolution) |
Code Generation | • Tokens/sec (under the hood)• Problems solved per minute• Test cases passed | • Per token (standard)• Sometimes per developer seat (e.g., GitHub Copilot) |
RAG / Multimodal Retrieval | • Latency per query (ms)• Queries per second (QPS)• Vector index size handled | • Per query• Sometimes scaled by docs retrieved or vector DB size |
Agentic AI / Automation | • Tasks completed per hour• End-to-end latency | • Per workflow run• Sometimes per action/API call |
Q: What kind of storage system is optimal for AI/ML workloads?
A: To know what kind of storage system is optimal the first question you have to answer is are you looking to build an AI system for training or for inference? You generally have to do one or the other and the storage requirements are different for training versus inference. For training the biggest thing you want to do is have high bandwidth storage with good parallel write capabilities for the training process. We'll discuss this more in the next session. You spend a noticeable percentage of your time doing checkpointing during training. Checkpointing is when you freeze your weights as you're training it and copy all that data out of GPU memory and into storage. That way if something happens, or if you end your training session, you have your latest version of the model in nonvolatile storage. Especially as the training environment size increases, you're writing out all the data from each GPU’s memory simultaneously. So, the quicker you can do that, and handle many different GPUs all trying to write to the same place you can noticeably improve your performance. On the inference side, you almost have the opposite problem. Instead of a structured set of demands on the storage system, where every so many minutes it's going to dump a fixed amount of data from a fixed number of nodes to it and then just do reads the rest of the time. Now you're kind of in a constant read/write battle with a changing I/O pattern and size. You're retrieving from different systems. It's much more about IO & latency sensitive. A previous webinar “AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads” provides a deep dive on this topic and would be a great resource to if you want to get into this topic in more detail.
Q: Earlier example for application in search was "puppies want to eat...", and the guesses in the screen shot came back as "puppy wants to eat...". This type of swap sometimes changes context and quality of search results. Is there a computational or financial advantage to displaying something already queried & answered instead of using the actual human input?
A: What you are pointing out here is an example of two things: similarity searching and answer caching. Both techniques are used to speed up response time while minimizing compute requirements and therefore cost, all with a minimum of quality loss in the answer. If a system can match to an existing answer or next token based on what it has in cache that is far less computationally expensive than actually running the query tokens through the model and generating new response tokens.
Q: In general, will each input neuron receive simultaneous independent feedback from the hidden layers?
A: Input neurons do not receive feedback from the hidden layers, they are used to store the input value only. During training, the first layer of hidden neurons (which are connected to all of the input neurons) will receive the input values and multiply each input value by the weight associated with the connection from the input neuron to that specific hidden neuron. The resulting products from each connection are summed along with the bias value for that hidden neuron before an activation function (e.g., Sigmoid) is applied to the value. The result is then passed to the next layer (in the case of the demo this would be the output layer). The output neuron follows a similar process, in that it multiplies the value of each hidden neuron by the weight of the connection to that output neuron. These values are summed along with the bias value and then an activation function is applied to the result. In our demo, we used Softmax. The resulting value is compared to the expected value and a loss function is applied (we used cross-entropy loss for the demo). Once the loss has been calculated, the error value is first used to calculate a gradient and the gradient is used to adjust the weights and bias of all output neurons and then finally, propagated back to the hidden layer so they may update their weights and bias values. All output neurons are updated at the same point in the training process followed by the updates to the hidden neurons. If you really want to dig into this in more detail, check out the Binary Digit Trainer app!
Q: For comparison, how many inputs and layers is ChatGPT?
A: The demo we reviewed was a simple feed-forward neural network, ChatGPT uses a transformer-based neural network. Since the exact details about ChatGPT’s latest architectures are not public, we asked ChatGPT if it could provide a ballpark estimate for comparison purposes. Here’s the response:
🔹 Binary Digit Trainer
- Type: Feedforward Neural Network
- Layers: 3 total
- Input layer: 81 neurons
- Hidden layer: 24 neurons
- Output layer: 2 neurons
- Purpose: Simple binary classification (e.g., recognizing patterns in binary digits)
🔹 ChatGPT (based on GPT-4)
- Type: Transformer-based Neural Network
- Layers: Estimated 120–160 transformer blocks
- Inputs:
- Each input is a token (a word or part of a word)
- Can process sequences of thousands of tokens (e.g., 8,000 to 32,000 tokens depending on the version)
- Neurons per layer: Each layer has thousands of hidden units (e.g., ~12,288 in GPT-3)
- Purpose: Natural language understanding and generation across a wide range of tasks (conversation, summarization, coding, reasoning, etc.)
Q: In the tiny network demo, can you show the logic in one of the nodes. Is the logic the same in each node? And is it the same in the hidden and output nodes?
A: Yes, Binary Digit Trainer allows you to follow the training process and see the calculations being used at each step. If you take the Guided Tour by clicking on the “Take Guided Tour” button in the upper right corner, it will step you through one complete training iteration. There are also “?” icons sprinkled throughout the app that provide much more detail. Here’s a link to Binary Digit Trainer
For the second part of your question, they are different. If you look at the output and the different output neurons, there are two of them. They're almost the inverse of one another. There are inhibitory and excitatory groups. They're almost literally flipped and that comes as a result of that error calculation and back propagation.
Q: In the demo, I did not fully understand how back propagation and gradient descent was implemented?
A: Binary Digit Trainer allows you to follow the training process and see the calculations being used at each step (this includes how back propagation is done). If you take the Guided Tour by clicking on the “Take Guided Tour” button in the upper right corner, it will step you through one complete training iteration. There are also “?” icons sprinkled throughout the app that provide much more detail. Here’s a link to Binary Digit Trainer.
Q: For the model in the demo, which resource is most utilized? CPU, memory, or storage?
A: The model and the training data are all running in memory and do not require persistent storage. We did add the ability to checkpoint the model and save it to persistent storage and also included a way to load the model into memory from persistent storage, but this isn’t strictly required.
That said, the App is very much CPU bound due to the extensive mathematical operations performed for each training sample. Memory utilization for the app is about 1 MB.
Q: Using the demo, what's the RAG? If the RAG is not defined, hallucinations can happen often?
A: RAG or Retrieval Augmented Generation will be discussed in future sessions. With regards to the demo, it doesn’t use RAG and really doesn’t need it for this use case.
Q: Do you need a GPU to do AI? Do CPUs have the ability to do AI?
A: Both CPUs and GPUs can support AI models. As models expand in size and capability GPUs generally become highly recommended. In future parts of our “AI Stack” webinar series, we'll dive into scale and deployment and cover some of this.
Q: Is there any formula to calculate the number of neurons or number of layers for a specific problem?
A: There are rules of thumb that help with that. For example, for the binary digit trainer used for the demo, we did some experimentation and found that we needed 81 input neurons because we had a 9 by 9 input. For the hidden layer, based on the results we were seeing, it seemed like the sweet spot was going to be somewhere between 24 and 40. We chose 24 because it fit nicely on the screen for the demo. There wasn’t that big of a difference between the accuracy of the model when we used 24 versus 40... That said, in general the answer is no because there are so many different kinds of problems. As you experiment, you’ll find AI requires a lot of experimentation at every level. The nice part is with so much in the public domain, you can see a lot of models that do the same tasks end up in about the same size range. How large is it while still maintaining accuracy and quality can be one of the differentiators between one model versus another.
Q: Really good AI 101 presentation for beginners. Question: How is storage changing with AI explosion. In particular SSD storage. What are the use cases?
A: The AI explosion has made a huge impact in the storage roadmap. The models like deep learning and LLMs (Large Language Models) have some primary requirements, including:
- Scale of data: Scale of data is changing from GigaBytes to TeraBytes to PetaBytes and now to ExaBytes, for storing training data, checkpointing (intermediate snapshot of trained model) and to store the fully trained models. Typically, file and object storage are used for in these scenarios- for “data at rest”. The physical medium can be HDDs or SSDs or magnetic tape depending upon the priority factors like cost, speed, durability etc
- The need for low latency and high performance: Faster data access is required during:
- Training phase to satisfy the requirements of multiple GPUs running in parallel
- Inference phase, driven mainly by RAG, KV cache requirements in the actual deployment. Typically, “AFAs (All Flash Arrays)” are used for this purpose, for “data in use”. The medium is SSDs.
Hence the SSD landscape is evolving faster coupled with technologies like NVMe-oF and CSDs (Computational storage devices)
To understand more about the role of storage in optimizing AI workloads, please watch this earlier webinar from SNIA, AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads.
Q: Do AI systems ever reset back to some point in time and have to re-learn some things?
A: Yes, that's why we have checkpointing. You could be running a training job for weeks and encounter an error and you would basically have to rewind your training to before the error. That's why we use checkpoints so that you can checkpoint at different frequencies. Some people checkpoint every couple of hours, but the idea is to find a sweet spot between recovery time and the amount of time it takes to take the checkpoint.
Q: Are HDDs of any use in AI storage systems or is it pure SSDs?
A: It really depends on the use case. For the things that need high throughput with lots of training data coming, images for example, latency and throughput are going to be critical and that would point to SSDs or something that gives you really low latency and high throughput. HDDs can be useful for taking a checkpoint and storing it, maybe eventually reloading it, but typically latency and throughput are incredibly important with these types of environments.
Q: What methods or mechanisms are available to prevent injecting incorrect data into the training models (e.g., deliberate malware)?
A: When preventing incorrect or "bad" data from getting into AI training models, several straightforward methods can be employed. Ensuring the quality of data used is crucial, and this can be achieved by using trusted data sources, automating checks with human oversight, and educating teams on data quality importance. Key strategies include data validation and sanitizing, anomaly and outlier detection, monitoring data sources, and regular model testing. By implementing these practices, the risk of "bad" data affecting AI models can be significantly reduced, protecting them from both unintentional mistakes and deliberate attempts to mislead. There are also many companies specializing in providing tools and services for data integrity and AI security, focusing on data preparation and validation.
Q: What is vectorization?
A: Vectorization converts structured data into numerical vectors, enabling similarity searches. It's basically the process of taking data from the way a human would read it and turning it into how the machine can look at it from a relationship perspective. We will discuss RAG & vectorization further in other parts of the upcoming sessions!
Q: How can biases be introduced in the algorithm?
A: Bias can be introduced at several different stages of the training process. The table below provides some examples.
Stage | How Bias is Introduced | Examples |
Training Data | Over/under-representation, historical/social bias, skewed sampling | Face datasets skewed toward lighter skin tones; English-dominant text corpora |
Data Labeling | Human annotators bring personal/cultural perspectives | Political text labeled as “toxic” depending on annotator’s worldview |
Model Objectives | Optimization for prediction/loss amplifies majority patterns | Language models learning stereotypes because they reduce prediction error |
Fine-Tuning (RLHF) | Human raters’ judgments shape model outputs | Reinforcement to prefer “polite” or “safe” answers, embedding cultural/political leanings |
Deployment Filters | Provider moderation and policy guardrails | Blocking certain topics, rephrasing answers to fit corporate values |
Feedback Loops | User feedback reinforces dominant group perspectives | Heavier feedback from one demographic shifts model alignment toward their views |
Beyond this, even the language the model is trained on can introduce cultural biases. This is one of the reasons that Sovereign AI is rapidly becoming a thing. For example, take a look at articles written on “Malaysia’s Sovereign AI Strategy”.
This is a Series!
We have an ambitious line-up of webinars in this "AI Stack" webinar series that continues to grow. Here are topics planned:
Introduction to AI and Machine Learning
Understanding Model Training
Model Inferencing and Deployment Options
Impact of AI on Network Interconnects
Parallelism in AI (Model, Data, Tensor)
Collective Communication Libraries (NCCL and RCCL)
In-Network Collective Operations (SHARP and UET)
MLOps Frameworks
AI Infrastructure
Management and Orchestration
Security Considerations for AI
Learn more about the AI Stack series and questions from this webinar in this SNIA Experts on Data Podcast Interview and follow us on LinkedIn and X for upcoming dates.
Leave a Reply