Reducing Replication Bandwidth for Distributed Document-oriented Databases

Author(s)/Presenter(s):
Library Content Type:
Publish Date: 
Thursday, September 22, 2016
Event Name: 
Focus Areas:
Abstract: 

With the rise of large-scale, Web-based applications, users are increasingly adopting a new class of document-oriented database management systems (DBMSs) that allow for rapid prototyping while also achieving scalable performance. As in other distributed storage systems, replication is important for document DBMSs in order to guarantee availability. The network bandwidth required to keep replicas synchronized is expensive and is often a performance bottleneck. As such, there is a strong need to reduce the replication bandwidth, especially for geo-replication scenarios where wide-area network (WAN) bandwidth is limited. This talk presents a deduplication system called sDedup that reduces the amount of data transferred over the network for replicated document DBMSs. sDedup uses similarity-based deduplication to remove redundancy in replication data by delta encoding against similar documents selected from the entire database. It exploits key characteristics of document-oriented workloads, including small item sizes, temporal locality, and the incremental nature of document edits. Our experimental evaluation of sDedup with real-world datasets shows that it is able to significantly outperform traditional chunk-based deduplication techniques in reducing data sent over the network while incurring negligible performance overhead.

Learning Objectives

Replication in distributed databases
Techniques for network bandwidth reduction
Similarity detection in sDedup
sDedup system design