Towards memory efficient RAG pipelines with CXL technology
Various stages in the RAG pipeline of AI Inference involve large amounts of data being processed. Specifically, the preparation of data to create vector embeddings and the subsequent insertion into a Vector DB requires a large amount of transient memory consumption. Furthermore, the search phase of a RAG pipeline, depending on the sizes of the index trees, parallel queries, etc. also result in an increased memory consumption.
We observe that the peak memory consumption is dependent on the load the RAG pipeline is under; whether the vectors are being inserted or updated and other such transient dynamic behaviors. Thus, we find that having local memory attached to meet the peak memory consumption is inefficient.
To improve the efficiency of RAG pipeline under the scenarios described, we propose the use of CXL based memory to meet the high memory challenges while reducing the statically provisioned local memory.
In specific, we explore two approaches:
1.) CXL Memory Pooling: Provisioning memory based on dynamic and transient needs to reduce locally attached memory costs.
2.) CXL Memory Tiering: Using cheaper and larger capacity memory to reduce locally attached memory costs.
We explore the current state of open-source infrastructure to support both solutions, and show that these solutions can result in significant DRAM cost saving for a minimal tradeoff in performance. Additionally, we comment on potential gaps in open source infrastructure and discuss potential ideas to bridge these gaps going forward.