Agentic AI is progressing at an incredibly rapid pace, moving into real-world deployments. At our recent SNIA Cloud Storage Technologies webinar, “How Agentic AI Transforms the Role of Storage,” our SNIA experts, Himabindu Tummala, John Cardente and Fatemeh Azmandian examined how Agentic AI operates in, on, with, and for storage systems—transforming them from passive data repositories into active participants in decision-making and workflow execution. If you missed the live event, you can access it at the SNIA Educational Library along with the presentation slides. Of course, with a technology as disruptive as Agentic AI, there were a lot of questions from our audience, which our speakers have kindly answered here.
Q: “How is KV Cache stored?
A: Each transformer layer keeps its own K/V tensors that grow with tokens, typically laid out [num_heads, seq_len, head_dim], and during decoding each layer appends the new keys/values so it can attend to all prior tokens without recomputation. In a hierarchical setup, the hot working set lives in HBM (GPU VRAM) for low‑latency reads; colder pages of the per‑layer KV can spill to CPU DRAM, and, for very long contexts, further offload to NVMe/SSD storage tiers with prefetching (paged KV) to bring blocks back to HBM when needed. This paged design lets multiple requests share prompt pages, reduces fragmentation, and balances latency vs. capacity by keeping the most‑needed KV in HBM while treating DRAM/NVMe as overflow.
Q: Does Mixture of Experts (MoE) require or need A2A, versus MCP, or are both still needed?
A: MoE is a technique, called sparse activation, that reduces the amount of computation required to do inference with an LLM. The way it works is that layers in the model are partitioned into two or more "experts." During inference, each token is routed to one of these experts. The net effect is that only a subset of model is computed. This allows the model to "learn more information" with processing all that information at inference time. All of this happens regardless of how the inference request is received. So, either MCP or A2A can be used to talk to an agent that uses an MoE reasoning model.
Q: Is there an MCP-based API that represents agent interaction for read/write data? "put" + "get"? "POSIX-like"?
A: The Model Context Protocol (MCP) supports resource access (like a read operation) and tool calls (for write operations), but it doesn’t directly mirror traditional REST or POSIX-style file operations. Instead, MCP uses a context-based, discoverable tool system, so it’s a bit more flexible than simple PUT or GET calls.
There are two ways we can achieve this: first, you can write your own MCP server that performs read and write operations on the file system using tools—such as a write_data tool for writing data and a read_data tool for reading it. Second, you can use prebuilt open‑source filesystem MCP servers that already provide this functionality, so you don’t need to implement your own server or tools. Once you have an MCP server in place, you simply integrate it into your agent and then invoke it to perform the required data operations.
Q: What will be the key performance and feature improvement needed for the Agentic AI comparing with the current NVMe SSD?
A: Looking at the different tiers of data in the agentic pipeline, NVMe SSDs will be asked to play a significantly different role, than just persistent block storage. The push into KeyValue data types as well as ephemeral storage, will push SSDs into a "Storage Adjacent" space, that is neither memory nor storage. This will change the economics of data placement but also change the perspective on common attributes of SSDs that have held true for decades, with persistence and endurance being two key attributes ripe for reinspection.
Q: Since we're all storage nerds, how do these workflows interact with various storage protocols like NVMe/TCP, NFS v4.2, S3, etc....
A: Agentic workloads span data retrieval, tool use, code execution, and long‑running plans, so teams combine NVMe/TCP, NFS v4.2, and S3 based on practical needs like locality, latency, throughput, durability, cost, and operational fit.
Q: When would you use an MCP vs A2A?
A: Agents should use MCP to access tools that, generally speaking, execute as deterministic and synchronous function calls. Agents should use A2A to interact with other agents that may execute delegated tasks non-deterministically and asynchronously with the possibility of requesting human-in-the-loop feedback or approval. Although future versions of MCP may support asynchronous tool calls, the A2A and MCP projects are actively collaborating within the Linux Foundation to ensure they remain complementary and focused on solving different aspects of agentic interoperability.
Q: As a storage vendor what type of workloads should be validated.
A: You should validate vector databases, knowledge graph databases along with inference and training.
Q: How is the KV Cache stored - as blocks or objects or files?
A: KV Cache stores KV on storage systems which is (G4 in NVIDIA terms) as file and objects with key as file name or object key.
Q: Do reasoning models follow the complete observe-reason-act loop for each and every iteration of the auto-regressive loop?
A: Auto-regressive token generation happens during the reasoning stage. Basically, the model will process the context, reason about it, and then generate a response, which may be a tool call or a final response. If it's a tool call, that is done and the results are added to the context that is processed by the next auto-regressive token generation step.
Q: How does the model decide when it’s done or takes another action?
A: When a language model generates text, it does so one token at a time: it looks at everything it has produced so far, runs that context through fixed mathematical machinery like self‑attention, and comes up with probabilities for what the next token could be. One of those options is then chosen based on sampling settings such as temperature, added to the context, and the process repeats. There’s no inner reflection or moment where the model “decides” it’s time to stop or call a tool — it’s just continuing the same prediction step over and over. When a model seems to switch to a tool or hand off to another expert, what’s really happening is that it outputs a particular learned pattern of tokens, and the surrounding system notices that pattern and takes action. In the same way, generation only ends because the model emits an end‑of‑sequence marker or because the external system stops it based on preset rules, not because the model itself knows it’s finished.
Q: Can you please address in more detail the 7x+ storage requirement you mentioned? Why is that required?
A: Agentic AI drives a major increase in context‑building workloads, persisting multiple representations—embeddings and vector indexes, knowledge graphs, and AI‑enriched metadata. This context layer can add ~7×–10× more stored data than the raw corpus.
Leave a Reply