BIG DATA

Material on this page is intended solely for the purpose of content review by SNIA members. Tutorial material may be read and commented upon by any SNIA member, but may not be saved, printed, or otherwise copied, nor may it be shared with non-members of the SNIA. Tutorial managers are responsible for responding to all comments made during the open review period. No responses will be given to comments made outside the open review period.

Jump straight to an abstract:

The Abstracts

Introduction to Analytics and Big Data - Hadoop
Rob Peglar
Download

This tutorial serves as a foundation for the field of analytics and Big Data, with an emphasis on Hadoop.  An overview of current data analysis techniques, the emerging science around Big Data and an overview of Hadoop will be presented.  Storage techniques and file system design for the Hadoop File System (HDFS) and implementation tradeoffs will be discussed in detail.  This tutorial is a blend of non-technical and introductory-level technical material..

Learning Objectives

  • Gain a further understanding of the field and science of data analytics
  • Comprehend the essential differences surrounding Big Data and why it represents a change in traditional IT thinking
  • Understand introductory-level technical detail around Hadoop and the Hadoop File System (HDFS)

Big Data Storage Options for Hadoop
Dr. Sam Finberg
Download

Companies are generating more and more information today, filling an ever growing collection of storage devices.  The information may be about customers, web sites, security, or company logistics.  As this information grows, so does the need for businesses to sift through the information for insights that will lead to increased sales, better security, lower costs, etc. The Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information.  It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes.  In this presentation Hadoop will be looked at from a storage perspective.  The presentation will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and other options for Hadoop storage that exist in the commercial, academic, and open source communities.

Learning Objectives

  • Understand the basics of Hadoop, both from a compute and storage perspective.  Understand how Hadoop uses storage, and how this is strongly tied to its native filesystem, HDFS.
  • Understand what other storage options have been adapted to work with Hadoop and how they differ from HDFS.  In particular, it looks at SAN storage and tightly coupled distributed file systems.
  • Understand the key tradeoffs between storage options including performance, reliability, efficiency, flexibility, and manageability.

Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data
Roger Cummings
Simona Rabinovici-Cohen

Download

Generating and collecting very large data sets is becoming a necessity in many domains that also need to keep that data for long periods. Examples include astronomy, atmospheric science, genomics, medical records, photographic archives, video archives, and large-scale e-commerce. While this presents significant opportunities, a key challenge is providing economically scalable storage systems to efficiently store and preserve the data, as well as to enable search, access, and analytics on that data in the far future.    Both cloud and tape technologies are viable alternatives for storage of big data and SNIA supports their standardization. The SNIA Cloud Data Management Interface (CDMI) provides a standardized interface to create, retrieve, update, and delete objects in a cloud. The SNIA Linear Tape File System (LTFS) takes advantage of a new generation of tape hardware to provide efficient access to tape using standard, familiar system tools and interfaces. In addition, the SNIA Self-contained Information Retention Format (SIRF) defines a storage container for long term retention that will enable future applications to interpret stored data regardless of the application that originally produced it.   

This tutorial will present advantages and challenges in long term retention of big data, as well as initial work on how to combine SIRF with LTFS and SIRF with CDMI to address some of those challenges. SIRF with CDMI will also be examined in the European Union integrated research project ENSURE – Enabling kNowledge, Sustainability, Usability and Recovery for Economic value.

Learning Objectives

  • Recognize the challenges and value in the long-term preservation of big data, and the role of new cloud and tape technologies to assist in addressing them
  • Identify the need, use cases, and proposed architecture of SIRF. Also, review the latest activities in SNIA LTR technical working group to combine SIRF with LTFS and SIRF with CDMI for long term retention and mining of big data.
  • Discuss the usage of SIRF with CDMI in the ENSURE project that draws on actual commercial use cases from health care, clinical trials, and financial services.

Protecting Data in the "Big Data" World
Thomas Rivera
Download

Data growth is in an explosive state, and these "Big Data" repositories need to be protected. In addition, new regulations are mandating longer data retention, and the job of protecting these ever-growing data repositories is becoming even more daunting. This presentation will outline the challenges and the methods that can be used for protecting "Big Data" repositories.

Learning Objectives

  • Participants will get to understand the unique challenges of managing and protecting "Big Data" repositories.
  • Participants will be able to understand the various technologies available for protecting "Big Data" repositories.
  • Participants will get to understand the various data protection considerations for "Big Data" repositories, for various environments, including Disaster Recovery/Replication, Capacity Optimization, etc

Massively Scalable File Storage
Philippe Nicolas
Download

Internet changed the world and continues to revolutionize how people are connected, exchange data and do business. This radical change is one of the cause of the rapid explosion of data volume that required a new data storage approach and design. One of the common element is that unstructured data rules the IT world. How famous Internet services we all use everyday can support and scale with thousands of new users added daily and continue to deliver an enterprise-class SLA ? What are various technologies behind a Cloud Storage service to support hundreds of millions users ? This tutorial covers technologies introduced by famous papers about Google File System and BigTable, Amazon Dynamo or Apache Hadoop. In addition, Parallel, Scale-out, Distributed and P2P approaches with Lustre, PVFS and pNFS with several proprietary ones are presented as well. This tutorial adds also some key features essential at large scale to help understand and differentiate industry vendors offering.

Learning Objectives

  • Understand various technologies around File Storage at megascale 
  • Anticipate the recent technology wave around distributed storage with design based on Google or Amazon research papers. 
  • Receive key elements and arguments to select the right solution for various needs.