Towards Memory Efficient RAG Pipelines with CXL Technology
Various stages in the RAG pipeline of AI Inference involve large amounts of data being processed. Specifically, the preparation of data to create vector embeddings and the subsequent insertion into a Vector DB requires a large amount of transient memory consumption. Furthermore, the search phase of a RAG pipeline, depending on the sizes of the index trees, parallel queries, etc.