Optimizing Sequence Alignment in Cloud Using Hadoop and MPP Database

Author(s)/Presenter(s):
Library Content Type:
Publish Date: 
Tuesday, September 18, 2012
Event Name: 
Abstract: 

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. This information can effectively be used for medical and biological research only if one can extract functional insight from it. To obtain functional insight the factors to be considered while aligning sequences are: optimized querying of sequences, high speed matching and accuracy of alignment. The FAST-All (FASTA) for both proteins and nucleotides program considers all these factors and follows a largely heuristic method, which contributes to the high speed of its execution. The program initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches rather than performing a more time-consuming, optimized search using a Smith-Waterman type of algorithm. This proposal is targeted at an optimized approach to sequence alignment using FASTA algorithm, which incorporates high speed word-to-word matching. In the current scenario where data growth is in petabytes a day and processing requires state of the art technologies, Greenplum Massively Parallel Processing (MPP) database and Hadoop are emerging parallel technologies which form the backbone of this proposal. The complex nature of the algorithm, coupled with data and computational parallelism of Hadoop grid and massively parallel processing database for querying from big datasets containing petabytes of sequences, improves the accuracy, speed of sequence alignment and optimizes querying from big datasets. Bioinformatics labs and centers across the globe today upload enormous amount of data and sequences in a central location for the scientific analysis. The transfer of such large datasets can also be simplified with Cloud approaches. So, Cloud Computing technology forms a strong candidate as the end point of such sequences and data gathered from various sources like medical research centers, scientists and biomedical labs around the globe. A plan for the final “publicly consumable” form of the program is to make it web-based and running on the Cloud.

Learning Objectives

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. This information can effectively be used for medical and biological research only if one can extract functional insight from it.
Learn about various problems and challenges faced by medical and biological research organisations in the area of bioinformatics sequence alignments.
To learn how Cloud Computing have an attractive solution towards providing massively scalable computational power and green credentials too.
To learn how complex nature of the FASTA algorithm is solved with the help of Hadoop grids which has the power of data and computational parallelism and MPP database for querying from big datasets containing large sequences, improves performance and optimizes querying from big datasets.
A massively parallel processing database in the form of Greenplum, coupled with the computational brilliance of Hadoop, built on the foundation of Cloud and virtualization with an optimized FASTA algorithm is ‘‘the next generation solution”.