Global Distributed Client-side Caching for HPC/AI storage systems
HPC and AI workloads require processing massive datasets and executing complex computations at exascale speeds to deliver time-critical insights. In distributed environments where storage systems coordinate and share results, communication overhead can become a critical bottleneck. This challenge underscores the need for storage solutions that deliver scalable, parallel access with microsecond latencies from compute clusters.
Caching can help reduce communication costs when implemented on either servers or clients. Servers, in this context, refer to the data servers that provide file system and object store functionalities, while clients denote the storage clients running on compute nodes in HPC/AI clusters that access and retrieve data from these servers. However, server-side caching is limited by the fixed memory and network bandwidth of individual servers. Traditional client-side caching, on the other hand, is typically node-local which limits data reuse across the cluster and often results in redundant caching efforts, leading to inefficiencies and duplicated data. Furthermore, without a shared global view, synchronizing caches consistently across nodes becomes challenging, further diminishing their overall effectiveness.
Global Distributed client-side caching over high-speed interconnects is attractive because it leverages the higher aggregate resources—such as DRAM, local SSDs, network bandwidth, and RDMA capabilities—available across the client nodes, scaling independently of the number of server nodes. However, fully realizing these benefits demands an efficient caching framework underpinned by carefully tuned policies to manage these valuable resources. In this presentation, we detail the design and implementation of an efficient, distributed client-side caching framework that addresses these challenges.