SNIA Developer Conference September 15-17, 2025 | Santa Clara, CA
Rapidly increasing data sizes, the high cost of data movement, and the advent of fast, NVMe-over-fabric based flash enclosures have led to the exploration of computation near flash for more efficient and economical storage solutions. Ordered key-value stores, commonly developed as software library code that runs inside application processes such as LevelDB and RocksDB, are one of many storage functions that can potentially benefit from offloaded processing. This is because traditional host managed key-value stores often exhibit long processing delays when their background worker threads cannot sort data as fast as a foreground writer application can write it, due to large amounts of data movement between the host and storage during the sort. Offloading key-value store computation to storage is interesting because it allows those data-intensive background tasks to be deferred and performed asynchronously on storage rather than on a host. This better hides background work latency and prevents it from blocking foreground writes. Offloaded key-value stores are interesting also because the key-value interface itself provides sufficient knowledge of data without requiring external metadata, leaving room for building more types of indexes such as secondary indexes and histograms. In this talk, we present KV-CSD, a research collaboration between SK hynix and Los Alamos National Lab (LANL) that explores the lab's next-generation performance-tier storage designs. A KV-CSD is a key-value based computational storage device consisting of a ZNS NVMe SSD and a System-on-a-Chip (SoC) that implements an ordered key-value store atop the SSD. It supports insertion, deletion, histogram generation, point/range queries over primary keys, and point/range queries over user-defined secondary index keys, a function that is often missed by today’s popular software key-value stores. We show why computational storage in the form of a hardware-accelerated key-value store is particularly interesting to LANL's simulation-based science workflows, how it fits into LANL's overall storage infrastructure designs, and how we implement KV-CSD to address bottlenecks experienced by scientists when high volumes of small data records previously written by a massively parallel simulation are subsequently read for interactive data analytics with potentially very selective queries against multiple data dimensions.
Compute, memory, storage, and connectivity demands are forcing the industry to adapt as it meets the expanding needs of cloud, edge, enterprise, 5G, and high-performance computing. UCIe — Universal Chiplet Interconnect Express — is an open industry standard founded by the leaders in semiconductors, packaging, IP suppliers, foundries, and cloud service providers to address customer requests for more customizable package-level integration. The organization is also fostering an open chiplet ecosystem by offering high-bandwidth, low-latency, power-efficient, and cost-effective on-package connectivity between chiplets. The UCIe 1.0 specification provides a fully defined stack that comprehends plug-and-play interoperability of chiplets on a package — similar to the seamless interplay on board with off-package interconnect standards such PCI Express®, Universal Serial Bus (USB)®, and Compute Express Link™ (CXL™). This presentation explores the industry demand that brought about the UCIe specification and shares how end-users can easily mix and match chiplet components provided by a multi-vendor ecosystem for System-on-Chip (SoC) construction, including customized SoC.
For the past three decades, PCI-SIG® has delivered a succession of industry-leading PCI Express® (PCIe®) specifications that remain ahead of the increasing demand for a high-bandwidth, low-latency interconnect for compute-intensive systems in diverse market segments, including data centers, Artificial Intelligence and Machine Learning (AI/ML), high-performance computing (HPC) and storage applications. In early 2022, PCI-SIG released the PCIe 6.0 specification to members, doubling the data rate of the PCIe 5.0 specification to 64 GT/s (up to 256 GB/s for a x16 configuration). To achieve high data transfer rates with low latency, PCIe 6.0 technology adds innovative new features like Pulse Amplitude Modulation with 4 levels (PAM4) signaling, low-latency Forward Error Correction (FEC) and Flit-based encoding. PCIe 6.0 technology is an optimal solution to meet the demands of Artificial Intelligence and Machine Learning applications, which often require high data bandwidth, low latency transport channels. This presentation will explore the benefits of PCIe 6.0 architecture for storage and AI/ML workloads and its impact on next-generation cloud data centers. Attendees will also learn about the potential AI/ML use cases for PCIe 6.0 technology. Finally, the presentation will provide a preview of what is coming next for PCIe specifications.
Modern AI systems usually require diverse data processing and feature engineering at a tremendous scale and employ heavy and complex deep learning model that requires expensive accelerators or GPUs. This leads to the typical design of running data processing and AI on two separate platforms, which leads to severe data movement issues and creates big challenges for efficient AI solutions. One purpose of AI democratization is to converge the software and hardware infrastructure and unified data processing and training on the same cluster, where a high-performance, scalable data platform will be a foundational component. In this session, we will introduce motivations and challenges of AI democratization, then we will propose a data platform architecture for E2E AI systems, from software and hardware infrastructure perspectives. It includes distributed compute and storage platform, parallel data processing and connector to deep learning training framework. We will also showcase how this data platform improved the pipeline efficiency of democratized AI solutions on commodity CPU cluster for several recommender system workloads like DLRM, DIEN, and WnD with orders of magnitude performance speedup.
AI/ML is not new, but innovations in ML models development have made it possible to process data at unprecedented speeds. Data scientists have used standard POSIX file systems for years, but as the scale and need for performance have grown, many face new storage challenges. Samsung has been working with customers on new ways of approaching storage issues with object storage designed for use with AI/ML. Hear how software and hardware are evolving to allow unprecedented performance and scale of storage for Machine Learning.
We present RAINBLOCK, a public blockchain that achieves high transaction throughput without modifying the proof-of-work consensus. The chief insight behind RAINBLOCK is that while consensus controls the rate at which new blocks are added to the blockchain, the number of transactions in each block is limited by I/O bottlenecks. Public blockchains like Ethereum keep the number of transactions in each block low so that all participating servers (miners) have enough time to process a block before the next block is created. By removing the I/O bottlenecks in transaction processing, RAINBLOCK allows miners to process more transactions in the same amount of time. RAINBLOCK makes two novel contributions: the RAINBLOCK architecture that removes I/O from the critical path of processing transactions (txs), and the distributed, multi-versioned DSM-TREE data structure that stores the system state efficiently. We evaluate RAINBLOCK using workloads based on public Ethereum traces (including smart contracts). We show that a single RAINBLOCK miner processes 27.4K txs per second (27× higher than a single Ethereum miner). In a geo-distributed setting with four regions spread across three continents, RAINBLOCK miners process 20K txs per second.
The next generation of automobiles moves to the adoption of PCIe for data communications in vehicles, and the JEDEC Automotive SSD solution enables a high performance, high reliability solution for this shared centralized storage. Features such as SR/IOV highlight the requirements of these computers on wheels with multiple SoC functions for vehicle control, sensors, communications, entertainment, and artificial intelligence.