From NASD to DeltaFS: CMU and Los Alamos's Efforts in Building Large-Scale Filesystem Metadata | SNIA

Abstract

It has been a tradition that, every once in a while, we stop and reassess whether we need to build our next filesystems differently. A key previous effort was made by the Carnegie Mellon University's NASD project, which decoupled filesystem data communication from metadata management and leveraged object storage devices for scalable data access. Now, as we enter into the exascale age, once again, we need bold ideas to advance parallel filesystem performance if we are to keep with up the rapidly increasing scale of today's massively-parallel computing environments. In this presentation, we introduce DeltaFS, a research project at Carnegie Mellon University and Los Alamos National Lab. DeltaFS is based on the premise that at exascale and beyond, synchronization of anything global should be avoided. Conventional parallel filesystems, with fully synchronous and consistent namespaces, mandate synchronization with every file create and other filesystem metadata operations. This must stop. At the same time, the idea of dedicating a single filesystem metadata service to meet the needs of all applications running on a single shared computing environment, is archaic and inflexible. This too must stop. DeltaFS allows parallel computing jobs to self-commit their namespace changes to logs later published to a registry, avoiding the cost of global synchronization. Followup jobs selectively merge logs produced by previous jobs as needed, a principle we term No Ground Truth which allows for scalable data sharing without requiring a global filesystem namespace. By following this principle, DeltaFS leans on the parallelism found when utilizing resources at the nodes where job processes run, improving metadata operation throughput as job processes increase. Synchronization is limited to an as-needed basis that is determined by the needs of followup jobs, through an efficient, log-structured format that lends itself to deep metadata writeback buffering and deferred metadata merging and compaction. Our evaluation shows that no ground truth enables more efficient inter-job communication, reducing overall workflow runtime by significantly improving client metadata operation latency and resource usage.

HPC metadata challenges
Current status on filesystem metadata management
New ways of managing distributed filesystem metadata in HPC environments