DNA-Based Storage of RDF Graph Data: A Futuristic Approach to Data Analytics | SNIA

Abstract

Future data analytics require enormous storage space for data-driven decisions, necessitating alternative storage sources for massive data archives. Since existing media have limitations, storage solutions have always been in demand. Deoxyribonucleic acid (DNA) is an emerging storage medium appropriate for archiving rapidly increasing digital volumes. Due to its longevity, DNA storage technology has led to numerous applications—particularly in the network biology and medicine domains—to store and retrieve entire data. DNA synthesis and sequencing costs can be reduced by compressing data in its entirety before storing it.

Previous attempts were made to store compressed data objects in DNA to reduce operational costs. Likewise, in a study, two images of 10,894 bytes in size were compressed into 3,633 bytes, and a total of $2,540 was spent on its testing and synthesis. Even though a movie, an image, or a book could be retrieved as a whole from DNA storage as needed, neither is equally applicable to complex graph data. Indeed, compressed data storage is helpful in both synthesis and sequencing operations to handle data cost-efficiently. However, even if only partial information is required, it is unnecessary to sequence and decode the complete archived data about a single complex graph, which is expensive and impractical. If future query demands are considered, any of the existing proposed DNA storage models are not appropriate. Prior works had not used DNA storage to retrieve partial information from complex graphs while leveraging advanced data analytics cost-effectively.

This paper presents a DNA-based query processing system to retrieve partial information efficiently from RDF graph data by sequencing a subset of DNA strands rather than all. Specifically, using binary search, we fetch and decode significantly fewer DNA strands to run SPARQL queries on RDF graph data. Our experimental analysis (based on two datasets composed of eight graphs) shows that the average data retrieval per query is less than 1% for RDF graphs larger than 1 megabyte. Therefore, sequencing costs are significantly reduced compared to retrieving all the data from a DNA library. However, more sequencing runs and additional index structures adversely affect sequencing time and one-time synthesis costs.