Abstract
Hadoop’s usage pattern, along with the underlying hardware technology and platform, are rapidly evolving. Further, cloud infrastructure, (public & private), and the use of virtual machines are influencing Hadoop. This talk describes HDFS evolution to deal with this flux.
We start with HDFS architectural changes to take advantage of platform changes such as SSDs, and virtual machines. We discuss the unique challenges of virtual machines and the need to move MapReduce temp storage into HDFS to avoid storage fragmentation.
Second we focus on real-time and streaming use cases and the HDFS changes to enable them, such as moving from node to storage locality, caching layers, and structure aware data serving.
Finally we examine the trend for on-demand and shared infrastructure, where HDFS changes are necessary to bring up and later freeze clusters in a cloud environment. How will Hadoop and Openstack work together? While use cases such as spinning up development or test clusters are obvious, one needs to avoid resource fragmentation. We discuss the subtle storage storage problems their solutions. Another interesting use case we cover is Hadoop as a service supplemented by valuable data from the Hadoop service provider. Here we contrast a couple of solutions and their trade-offs, including one that we deployed for a Hadoop service provider.