Abstract
HDFS is the storage backbone that supports Hadoop's map/reduce jobs. Designed to run on inexpensive commodity hardware, HDFS uses triple replication to achieve reliability and availability. We have applied the scalability, reliability and efficiency benefits of information dispersal to yield a new implementation of Hadoop's FileSystem interface. This implementation is built on our existing object storage API and thus it eliminates the overhead of replication while achieving superior fault-tolerance. Further, it has retained data-local computation, a primary feature and benefit offered by Hadoop. This presentation elaborates on the numerous challenges encountered in the creation of this implementation, and explains how each was overcome.
Learning Objectives
Design of HDFS, and special requirements of Hadoop
Basics of Information Dispersal and Namespace
Problems overcome through our new design for Hadoop's FileSystem