Abstract
Illumina CEO, recently announced availability of whole genome sequencing for just under $1000. By 2020 whole genome sequencing could cost about $200. Today, utilizing these technologies, a typical research program could generate from tens of terabytes to petabytes of data for a single study. Within ten years, a large genomic research program may need to analyze many petabytes to Exabyte of data.
Adding patient’s genomic date to patient Electronic Health Record (EHR) will increase per patient dataset size from at most a few Gigabytes (today) to several terabytes. So, in a mid to large size hospital computer storage requirements, and associated computing power and network infrastructure performance will need to increase by at least three order of magnitude. Due to patient privacy, regulatory requirements, and issues related to cyber security, healthcare institute such as major hospitals are very reluctant in utilizing public cloud computing, and also, private cloud technology is not appropriate for distributed research collaboration, and large-scale interoperability across many organizations.
Current computing infrastructure of most life sciences research centers, and healthcare organizations/hospitals have not been architected/designed to handle “HUGE” Big Data analytics, which is require to manage many Petabytes to Exabyte dataset class size, especially addressing requirements with regard to research collaboration across many organizations.
Learning Objectivies
Review current technology, and common systems architecture used for Big Data Analytics in Health Sciences vs other industries.
Discuss issues, challenges and potential solutions for real-time and archived data storage managements
Review, Data integrity/Privacy/Cyber Security concerns of major healthcare/research centers
Present scalable open source computing platform to manage Exabyte class datasets