Collaborative Research: Supporting Efficient Discrete Box Queries for Sequence Analysis on Large Scale Genome Databases
Sponsored by the National Science Foundation
Principal Investigator at UM-D
Dr. Qiang Zhu
Department of Computer and Information Science,
The University of Michigan, Dearborn, MI, USA
qzhu@umich.edu
(NSF Grant for UM-D: IIS-1320078)
Principal Investigator at MSU
Dr. Sakti Pramanik
Department of Computer Science and Engineering,
Michigan State University, East Lansing, MI, USA
pramanik@msu.edu
(NSF Grant for MSU: IIS-1319909)
Co-Principal Investigator at MSU
Dr. C. Titus Brown
Department of Computer Science and Engineering,
Michigan State University, East Lansing, MI, USA
ctb@msu.edu
Senior Personel
Dr. James R. Cole
Department of Plant, Soil and Mirobial Sciences,
Michigan State University, East Lansing, MI, USA
colej@msu.edu
Graduate Students
Xianying Liu
Yarong Gu
Sibi Vinayak Muthu Seenuthurai
AKM Tauhidul Islam
Xinge Ji
Undergraduate Students
Ramblin Cherniak
Jason Russell
Yangyue Wan
Dong-Yoon Choi
Project Overview
This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. In particular, a new index tree, named the BoND-tree, specially designed for a non-ordered discrete data space characterized by k-mer data sets is developed. The unique properties of the space are exploited to develop new node splitting heuristics for the index tree, and theoretical analysis is performed to show the optimality of the proposed heuristics. Besides the BoND-tree, which is based on data partitioning, space-partitioning based index schemes for box quieres in such a space are also developed. To support a more flexible type of query (i.e., hybrid box and range queries), hybrid index schemes integrating strengths of both box query indexes and range query indexes are studied. To facilitate an efficient index construction for large scale k-mer data sets, bulk loading techniques are also developed for the proposed index trees. In addition, the approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are also explored. The research in the project will result in the discovery of fundamental properties of the data space for sequence data in bioinformatics, the development of a number of novel storage, indexing and retrieval techniques exploiting the properties of such a data space, and the applications of the proposed techniques for solving important problems in sequence analysis. These results will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial.
Project Publications
-
X. Liu, Qiang Zhu, S. Pramanik, C. T. Brown and G. Qian, "VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses", IEEE Transactions on Knowlegdge and Data Engineering (TKDE), 2019 (to appear).
-
Y. Gu, Qiang Zhu, X. Liu, Y. Dong, C. T. Brown and S. Pramanik, "Using Disk Based Index and Box Queries for Genome Sequencing Error Correction", Proc. of the 6th International Conference on Bioinformatics and Computational Biology (BICoB'16), pp. 69 - 76, Las Vegas, Negada, April 2016.
-
AKM Tauhidul Islam, Sakti Pramanik, Xinge Ji, James R. Cole, and Qiang Zhu, "Back Translated Peptide k-mer Search and Local Alignment in Large DNA Sequence Databases Using BoND-SD-tree Indexing", Proc. of the 15th IEEE International Conference on Bioinformatics and BioEngineering (BIBE'15), pp. 1 - 6, Belgrade, Serbia, Nov. 2015.
-
Yarong Gu, Xinying Liu, Qiang Zhu, Youchao Dong, C. Titus Brown and Sakti Pramanik, "A New Method for DNA Sequencing Error Verification and Correction via an On-Disk Index Tree",
Proc. of the 6th ACM International Conference on Bioinformatics, Computational Biology, and Biomedical Informatics (ACM-BCB 2015), pp. 503 - 504, Atlanta, GA, September 2015.
-
Alok Watve, Sakti Pramanik, Salman Shahid, Chad R. Meiners, and Alex Liu, "Topological Transformation Approaches to Database Query Processing", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 27, No. 5, pp. 1438-1451, 2015.
References
-
Hyun-Jeong Seok, Gang Qian, Qiang Zhu, Alexander Oswald and Sakti Pramanik,
"Bulk-Loading the ND-Tree in Non-ordered Discrete Data Spaces", Proc. of 13th International Conference on Database Systems for Advanced Applications (DASFAA'08), pp 156-171, New Delhi, India, March 19 - 22, 2008.
-
Qiang Xue, James Cole and Sakti Pramanik, "Sequence Homology Search Based on Database Indexing Using the Profile Hidden Markov Model", Proc. of IEEE International Conference on Bioinformatics and Bioengineering (BIBE'06), pp 135-140, Washington, DC, Oct. 2006.
Project Web Sites
Web site
at MSU
Web site
at UM-D