Sorry, you need to enable JavaScript to visit this website.

SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA

Disrupting the GPU Hegemony: Can Smart Memory and Storage Redefine AI Infrastructure

Abstract

AI infrastructure is dominated by GPUs—but should it be? As foundational model inference scales, performance bottlenecks are shifting away from compute and toward memory and I/O. HBM sits underutilized, KVCache explodes, and model transfer times dominate pipeline latency. Meanwhile, compression, CXL fabrics, computational memory, and SmartNIC-enabled storage are emerging as powerful levers to close the tokens-per-second-per-watt gap. This panel assembles voices from across the AI hardware and software stack to ask the hard question: Can memory and storage innovation disrupt the GPU-centric status quo—or is AI destined to remain homogeneous? You’ll hear from a computational HBM vendor, an AI accelerator startup, a compression IP company, a foundational model provider, and a cloud-scale storage architect: Potential panelists: computational HBM vendor(Numem), an AI accelerator startup(Recogni), a compression IP company(MaxLinear), a foundational model provider(Zyphra), and a cloud-scale storage architect(Solidigm). . Together, they’ll explore: Why decode-heavy inference is choking accelerators—even with massive FLOPs Whether inline decompression and memory-tiering can fix HBM underutilization How model developers should (or shouldn’t) design for memory-aware inference Whether chiplet and UCIe-based systems can reset the balance of power in AI Expect live debate, real benchmark data, and cross-layer perspectives on a topic that will define AI system economics in the coming decade. If you care about performance-per-watt, memory bottlenecks, or building sustainable AI infrastructure—don’t miss this conversation.

Learning Objectives

Understand compression's role across memory, storage, and compute tiers in inference. Explore real bottlenecks behind HBM underutilization and decoder latency in LLMs. Discover how inline decompression, CXL pooling, and new memory formats interact. Get an end-to-end view—from model compression to DRAM subsystem—on optimizing tokens/sec/$. Debate where responsibility lies for solving memory inefficiency in AI inference.