In our webinar, Accelerating AI with Real-World CXL Platforms, SNIA Persistent Memory Special Interest Group chair Arthur Sainio, and members Andy Mills of SMART Modular Technologies, Anil Godbole of Intel Corporation, and Steve Scargall of MemVerge discussed the CXL® advantage for real-world AI workloads, how to boost your AI workload performance using CXL memory, and how to deploy CXL in next-generation AI/ML systems.  In this blog post, we answer your questions and welcome any additional questions at askcms@snia.org 

Q Was any analysis done on total power consumption of a DRAM/VRAM-only setup vs. CXL-interleaved setup?

A. We think the goal of disaggregating memory is not to look cool - we just moved it outside the CPU. It's more that we're able to more efficiently control the amount of memory that we can distribute across those CPUs. So, we're not overprovisioning anymore. To answer your question, there are no specific measurements yet but something that we will be working on and publishing on. To achieve the same bandwidth which we get with interleaved CXL memory, we have to run DRAM channels at higher speeds than in the DRAM-only case. So we expect the overall power to remain the same between the two cases.  The total memory remains the same.  Now obviously you’ve got to offset that against networking and adding a new appliance and all those kinds of things. 

Now we're going into the practical phase of CXL. Yes, you always have to worry about the Thermal Design Power (TDP) budget of the CPU. Once you know the system gets hard, then the CPU will come down in frequency. When we ran right, we were assuming about 25 watts per CXL module, so we didn't blow the power budget that much. We were able to operate the CPU at the rated speed and basically kept the TDP budget of the CPU. We were able to run these tests within the CPU’s TDP envelope. If there was any extra power in the CXL-interleaved case then the CPU would have throttled down its speed.

Q.  Can GPU memory have a memory hierarchy of HBM -> Localized CXL type memory -> Local NVMe -> Networked NVMe? 

A. This is entirely doable. We are very interested in the NVIDIA Inference Xfer Library (NIXL) and the architecture frameworks highlighted in this webinar because they kind of do this. The key value block manager in NIXL is entirely dedicated to this particular thing. Companies are looking at introducing CXL into that memory storage hierarchy pyramid so that NVIDIA Dynamo and NIXL have the ability to read and write to CXL as a new tier of memory. Also, weighted interleaving in Linux ‘s ability to combine the capacity and bandwidth of DRAM plus CXL gives us that ability as well to use CXL in LLM solutions.  So stay tuned.

Q: Even though CXL has many advantages, I couldn't see much use of CXL in the AI Compute server. What do you think are the reasons for the slow start on this?

A: Both CXL and GPUs share the same PCIe bus, so you must design the server to accommodate the hardware required for the intended application's needs. However, CXL has advantages over other technologies such as NVMe, Ethernet, RDMA, etc., as it offers lower latency and higher bandwidth. When combined with CXL Pooling or Sharing, multiple AI servers can now share memory and use CXL as a low-latency transport/communication path. CXL can also be used to offload the KV-Cache and model weights from GPUs. DMA operations allow the GPU to control when data is moved.

Thanks for your interest in SNIA and its educational content. Learn more about SNIA work in memory and with CXL in our Educational Library.  Simply enter CXL and Memory as search words to see the list of webinars, slide PDFs, and white papers on these and many other subjects.