SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
Emerging Deep Learning/ Machine learning and cloud native Applications at data center scale demand terabytes of data flowing across the storage/ memory hierarchy, straining interconnect bandwidth and component capacities. The Industry has responded with a wide range of solutions like process node shrink, higher capacity devices, new tiers, innovative form factors, new interconnect technologies and fabrics, new types of compute architectures, new algorithms and more to creatively leverage storage/memory tiering.
New paradigms like Computational storage/ memory accelerator offloads are under intense exploration to process data where it resides to ease movement of exponentially generated data. At the same time, progress has hit the proverbial wall: practical hurdles limit the scalability at every level of the memory hierarchy. On-die SRAM scaling seems to have completely stalled going from 5nm to 3nm, limiting processor IPC (Instructions per cycle) performance. Main Memory bandwidth per processor core growth slowed dramatically compared to the growth of compute "FLOPs". New memory tiers like CXL memory dramatically increase capacity per core, but at the expense of latency and the need for all new infrastructure. QLC SSDs provide Terabytes of capacity in a single device, but are limited by endurance and overprovisioning requirements. Staying within established power, thermal and cost budgets at each level of the hierarchy and at the system envelope level is critical to ease new technology introductions.
To address these challenges, Data Center customers, component manufacturers and researchers alike are investigating or have implemented several innovations, like lossless compression technology, at various levels of the hierarchy, to increase capacity, enhance effective bandwidth and stay within cost and power budgets. Compression requires more than just algorithmic implementation...compaction, management, software compatibility are critical considerations in order to be widely deployable at scale.
One size does not fit all: choices need to be made between various industry standard and proprietary algorithms, operating at varying granularities: cache line, page or file. CXL memory semantic SSDs are emerging, compression technology requires integration with cxl.io, cxl.mem semantics, dynamic capacity has to be addressed. Offload accelerators are now available within several platform ingredients, but choices need to be made carefully between processor-integrated accelerators, cores on SmartNICs ("DPUs", "IPUs"), IP/firmware integrated into SSD and CXL controllers/ switches, "AFU" (Accelerator Functional Unit) on board specialized FPGAs and purely software offloads.
In this panel session, we will explore the need, opportunities, challenges and implications of emerging data compression techniques and accelerators associated with storage and memory technologies through diverse viewpoints of ecosystem participants, including an SOC Architect, technologists in the storage/memory device and controller space, Academic Researcher in the storage and systems domain as well as Hardware IP provider. We will simulate the type of discussion that typically takes place between technologists, architects and end customers to meet design and TCO requirements, requirements to integrate into existing kernel and application software stacks. Attendees will have an opportunity to ask questions of the panel and share their collective industry/ research insights.
Compute, memory, storage, and connectivity demands are forcing the industry to adapt as it meets the expanding needs of cloud, edge, enterprise, 5G, and high-performance computing. UCIe — Universal Chiplet Interconnect Express — is an open industry standard founded by the leaders in semiconductors, packaging, IP suppliers, foundries, and cloud service providers to address customer requests for more customizable package-level integration. The organization is also fostering an open chiplet ecosystem by offering high-bandwidth, low-latency, power-efficient, and cost-effective on-package connectivity between chiplets. The UCIe 1.0 specification provides a fully defined stack that comprehends plug-and-play interoperability of chiplets on a package — similar to the seamless interplay on board with off-package interconnect standards such PCI Express®, Universal Serial Bus (USB)®, and Compute Express Link™ (CXL™). This presentation explores the industry demand that brought about the UCIe specification and shares how end-users can easily mix and match chiplet components provided by a multi-vendor ecosystem for System-on-Chip (SoC) construction, including customized SoC.
For the past three decades, PCI-SIG® has delivered a succession of industry-leading PCI Express® (PCIe®) specifications that remain ahead of the increasing demand for a high-bandwidth, low-latency interconnect for compute-intensive systems in diverse market segments, including data centers, Artificial Intelligence and Machine Learning (AI/ML), high-performance computing (HPC) and storage applications. In early 2022, PCI-SIG released the PCIe 6.0 specification to members, doubling the data rate of the PCIe 5.0 specification to 64 GT/s (up to 256 GB/s for a x16 configuration). To achieve high data transfer rates with low latency, PCIe 6.0 technology adds innovative new features like Pulse Amplitude Modulation with 4 levels (PAM4) signaling, low-latency Forward Error Correction (FEC) and Flit-based encoding. PCIe 6.0 technology is an optimal solution to meet the demands of Artificial Intelligence and Machine Learning applications, which often require high data bandwidth, low latency transport channels. This presentation will explore the benefits of PCIe 6.0 architecture for storage and AI/ML workloads and its impact on next-generation cloud data centers. Attendees will also learn about the potential AI/ML use cases for PCIe 6.0 technology. Finally, the presentation will provide a preview of what is coming next for PCIe specifications.
Modern AI systems usually require diverse data processing and feature engineering at a tremendous scale and employ heavy and complex deep learning model that requires expensive accelerators or GPUs. This leads to the typical design of running data processing and AI on two separate platforms, which leads to severe data movement issues and creates big challenges for efficient AI solutions. One purpose of AI democratization is to converge the software and hardware infrastructure and unified data processing and training on the same cluster, where a high-performance, scalable data platform will be a foundational component. In this session, we will introduce motivations and challenges of AI democratization, then we will propose a data platform architecture for E2E AI systems, from software and hardware infrastructure perspectives. It includes distributed compute and storage platform, parallel data processing and connector to deep learning training framework. We will also showcase how this data platform improved the pipeline efficiency of democratized AI solutions on commodity CPU cluster for several recommender system workloads like DLRM, DIEN, and WnD with orders of magnitude performance speedup.
AI/ML is not new, but innovations in ML models development have made it possible to process data at unprecedented speeds. Data scientists have used standard POSIX file systems for years, but as the scale and need for performance have grown, many face new storage challenges. Samsung has been working with customers on new ways of approaching storage issues with object storage designed for use with AI/ML. Hear how software and hardware are evolving to allow unprecedented performance and scale of storage for Machine Learning.
We present RAINBLOCK, a public blockchain that achieves high transaction throughput without modifying the proof-of-work consensus. The chief insight behind RAINBLOCK is that while consensus controls the rate at which new blocks are added to the blockchain, the number of transactions in each block is limited by I/O bottlenecks. Public blockchains like Ethereum keep the number of transactions in each block low so that all participating servers (miners) have enough time to process a block before the next block is created. By removing the I/O bottlenecks in transaction processing, RAINBLOCK allows miners to process more transactions in the same amount of time. RAINBLOCK makes two novel contributions: the RAINBLOCK architecture that removes I/O from the critical path of processing transactions (txs), and the distributed, multi-versioned DSM-TREE data structure that stores the system state efficiently. We evaluate RAINBLOCK using workloads based on public Ethereum traces (including smart contracts). We show that a single RAINBLOCK miner processes 27.4K txs per second (27× higher than a single Ethereum miner). In a geo-distributed setting with four regions spread across three continents, RAINBLOCK miners process 20K txs per second.
The next generation of automobiles moves to the adoption of PCIe for data communications in vehicles, and the JEDEC Automotive SSD solution enables a high performance, high reliability solution for this shared centralized storage. Features such as SR/IOV highlight the requirements of these computers on wheels with multiple SoC functions for vehicle control, sensors, communications, entertainment, and artificial intelligence.