Abstract
Companies are generating more and more information today, filling an ever growing collection of storage devices. The information may be about customers, web sites, security, or company logistics. As this information grows, so does the need for businesses to sift through the information for insights that will lead to increased sales, better security, lower costs, etc. The Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information. It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes. In this presentation Hadoop will be looked at from a storage perspective. The presentation will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and other options for Hadoop storage that exist in the commercial, academic, and open source communities.
Learning Objectives
Understand the basics of Hadoop, both from a compute and storage perspective. Understand how Hadoop uses storage, and how this is strongly tied to its native filesystem, HDFS.
Understand what other storage options have been adapted to work with Hadoop and how they differ from HDFS.
Understand the key tradeoffs between storage options including performance, reliability, efficiency, flexibility, and manageability.