Data Deduplication for Distributed Segmented Parallel Filesystem | SNIA

Abstract

This presentation explores the design ideas behind de-duplicating of data in the distributed segmented parallel file systems (Ibrix). There are special challenges related to the large scale of our file system. Many entry point servers generate new content simultaneously; meta-data and directories are widely distributed; the system can grow both in capacity and performance by adding new storage segments and destination storage servers. While adding ability to de-duplicate the data content we have to preserve flexibility and scalability of the original design. This presentation shows the key points of our design for de-duplication: how to achieve the balance between efficiency of de-duplication and the size of indexes, how to use RAM efficiently, how to preserve parallelism and efficiency of I/O streams, how to avoid bottlenecks and scale linearly by adding more storage and servers.

Learning Objectives

Expose fundamentals of the highly distributed segmented parallel file system architecture
Review the challenges of associated with data de-duplication in such environment
Explore details of the design: indexes, data containers, representative indexing, evolution of index
Review effectiveness of data placement and parallelism of I/O streams
Review the basis for scalability and parallelism