Enterprises are rushing to adopt AI inference solutions with RAG to solve business problems, but enthusiasm for the technology's potential is outpacing infrastructure readiness. It quickly becomes prohibitively expensive or even impossible to use more complex models and bigger RAG data sets due to the cost of memory. Using open-source software components and high-performance NVMe SSDs, we explore two different but related approaches for solving these challenges and unlocking new levels of scale: offloading model weights to storage using DeepSpeed, and offloading RAG data to storage using DiskANN. By combining these, we can achieve (a) more complex models running on GPUs that it was previously impossible to use, and (b) greater cost efficiency when using large amounts of RAG data. We'll talk through the approach, share benchmarking results, and show a demo of how the solution works in an example use case.
Learning Objectives
The opportunities and challenges associated with AI inference with RAG
The solution stack that enables offload of significant amounts of AI data from memory to SSD
The impact of SSD offload to DRAM usage, QPS, index time, and recall
The results of the SSD offload approach in an example use case (traffic video)
Main Speaker / Moderator
Track
Webform Submission ID
145