Reducing the amount of data is a huge advantage of saving a total cost of ownership for a distributed storage system. To do this, a deduplication method which removes redundant data is being used as one of the promising solutions to save storage capacity. However, in practice, traditional deduplication methods designed for a local storage system is not suitable for a distributed storage system due to several challenging issues. First, I/O overhead due to additional data and metadata processing can have a huge impact on performance, and the deduplication ratio is not high enough due to data distributed across multiple nodes. Second, it is not easy to design efficient metadata management for deduplicated data along with legacy metadata management due to scale-out characteristics. To address these challenges, in this talk, we propose a global deduplication method with a multi-tiered storage design and self-contained metadata structure. A tiering with deduplication-aware replacement policy can improve a deduplication efficiency by filtering out more important chunks which have high deduplication ratio. In addition, by adopting a self-contained metadata structure, it can also provide compatibility with existing storage features, recovery and snapshot. As a result, our proposed tiering-based global deduplication can reduce I/O traffic and save storage cost for a distributed storage system.
- Deduplication
- Tiering
- Distributed Storage System