Collaborative Research: Supporting Efficient Discrete Box Queries for Sequence Analysis on Large Scale Genome Databases

Sponsored by the National Science Foundation

Principal Investigator at UM-D

Dr. Qiang Zhu
Department of Computer and Information Science,
The University of Michigan, Dearborn, MI, USA
qzhu@umich.edu
(NSF Grant for UM-D: IIS-1320078)

Principal Investigator at MSU

Dr. Sakti Pramanik
Department of Computer Science and Engineering,
Michigan State University, East Lansing, MI, USA
pramanik@msu.edu
(NSF Grant for MSU: IIS-1319909)

Co-Principal Investigator at MSU

Dr. C. Titus Brown
Department of Computer Science and Engineering,
Michigan State University, East Lansing, MI, USA
ctb@msu.edu

Senior Personel

Dr. James R. Cole
Department of Plant, Soil and Mirobial Sciences,
Michigan State University, East Lansing, MI, USA
colej@msu.edu

Graduate Students

Xianying Liu
Yarong Gu
Sibi Vinayak Muthu Seenuthurai
AKM Tauhidul Islam
Xinge Ji

Undergraduate Students

Ramblin Cherniak
Jason Russell
Yangyue Wan
Dong-Yoon Choi

Project Overview

This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. In particular, a new index tree, named the BoND-tree, specially designed for a non-ordered discrete data space characterized by k-mer data sets is developed. The unique properties of the space are exploited to develop new node splitting heuristics for the index tree, and theoretical analysis is performed to show the optimality of the proposed heuristics. Besides the BoND-tree, which is based on data partitioning, space-partitioning based index schemes for box quieres in such a space are also developed. To support a more flexible type of query (i.e., hybrid box and range queries), hybrid index schemes integrating strengths of both box query indexes and range query indexes are studied. To facilitate an efficient index construction for large scale k-mer data sets, bulk loading techniques are also developed for the proposed index trees. In addition, the approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are also explored. The research in the project will result in the discovery of fundamental properties of the data space for sequence data in bioinformatics, the development of a number of novel storage, indexing and retrieval techniques exploiting the properties of such a data space, and the applications of the proposed techniques for solving important problems in sequence analysis. These results will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial.

Project Publications

X. Liu, Qiang Zhu, S. Pramanik, C. T. Brown and G. Qian, "VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses", IEEE Transactions on Knowlegdge and Data Engineering (TKDE), 2019 (to appear).

Y. Gu, Qiang Zhu, and S. Pramanik, "An Online Approach for DNA Sequencing Error Correction via Disk Based Index", Proc. of the 27th International Conference on Software Engineering and Data Engineering (SEDE'18), pp. 31 - 38, New Orleans, October 2018.

R. Cherniak, Qiang Zhu, Y. Gu and S. Pramanik, "Exploring Deletion Strategies for the BoND-Tree in Multidimensional Non-ordered Discrete Data Spaces", Proc. of 21th International Database Appplications & Engineering Symposium (IDEAS'17), pp. 153 - 160, Bristol, UK, July 2017.

D.-Y. Choi, AKM T. Islam, S. Pramanik and Qiang Zhu, "A Bulk-Loading Algorithm for the BoND-tree Index Scheme for Non-ordered Discrete Data Spaces", Proc. of 25th International Conference on Software Engineering and Data Engineering (SEDE'16), pp. 123-128, Denver, September 2016.

Y. Gu, Qiang Zhu, X. Liu, Y. Dong, C. T. Brown and S. Pramanik, "Using Disk Based Index and Box Queries for Genome Sequencing Error Correction", Proc. of the 6th International Conference on Bioinformatics and Computational Biology (BICoB'16), pp. 69 - 76, Las Vegas, Negada, April 2016.

AKM Tauhidul Islam, Sakti Pramanik, Xinge Ji, James R. Cole, and Qiang Zhu, "Back Translated Peptide k-mer Search and Local Alignment in Large DNA Sequence Databases Using BoND-SD-tree Indexing", Proc. of the 15th IEEE International Conference on Bioinformatics and BioEngineering (BIBE'15), pp. 1 - 6, Belgrade, Serbia, Nov. 2015.

Yarong Gu, Xinying Liu, Qiang Zhu, Youchao Dong, C. Titus Brown and Sakti Pramanik, "A New Method for DNA Sequencing Error Verification and Correction via an On-Disk Index Tree", Proc. of the 6th ACM International Conference on Bioinformatics, Computational Biology, and Biomedical Informatics (ACM-BCB 2015), pp. 503 - 504, Atlanta, GA, September 2015.

Alok Watve, Sakti Pramanik, Salman Shahid, Chad R. Meiners, and Alex Liu, "Topological Transformation Approaches to Database Query Processing", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 27, No. 5, pp. 1438-1451, 2015.

Dashiell Kolbe, Qiang Zhu and Sakti Pramanik, "k-Nearest Neighbor Searching in Hybrid Spaces", Information Systems (IS), Vol. 43, pp. 55-64, 2014.

Gang Qian, Qiang Zhu, and Sakti Pramanik, "A Performance-Guaranteed Approximate Range Query Algorithm for the ND-Tree", Proc. of the 23rd International Conference on Software Engineering and Data Engineering (SEDE'14), pp. 97-104, New Orleans, October 2014.

Changqing Chen, Alok Watve, Sakti Pramanik and Qiang Zhu, "The BoND-tree: An Efficient Indexing Method for Box Queries in Non-ordered Discrete Data Spaces", IEEE Transactions on Knowledge and Data Engineering (TKDE),, Vol. 25, No. 11, pp. 2629-2643, 2013.

References

Dashiell Kolbe, Qiang Zhu and Sakti Pramanik, "Efficient k-Nearest Neighbor Searching in Non-ordered Discrete Data Spaces", ACM Transactions on Information Systems (TOIS), Vol. 28, No. 2, pp. 7:1 - 7:33, 2010.

Dashiell Kolbe, Qiang Zhu and Sakti Pramanik, "Reducing Non-Determinism of k-NN Searching in Non-Ordered Discrete Data Spaces", Information Processing Letters (IPL), Vol. 110, No. 10, pp 420-423, 2010.

Changqing Chen, Sakti Pramanik, Qiang Zhu, Alok Watve and Gang Qian, "The C-ND Tree: A Multidimensional Index for Hybrid Continuous and Non-ordered Discrete Data Spaces", Proc. of the 12th International Conference on Extending Database Technology (EDBT'09), pp 462-471, St. Petersburg, Russia, March 2009.

Hyun-Jeong Seok, Qiang Zhu, Gang Qian, Sakti Pramanik and Wen-Chi Hou, "Deletion Techniques for the ND-tree in Non-ordered Discrete Data Spaces", Proc. of the 18th International Conference on Software Engineering and Data Engineering (SEDE'09), pp 1-6, Las Vegas, Nevada, June 2009.

Changqing Chen, Sakti Pramanik, Qiang Zhu and Gang Qian, "A Study of Indexing Strategies for Hybrid Data Spaces", Proc. of the 11th International Conference on Enterprise Information Systems (ICEIS'09), pp 149-159, Milan, Italy, May 2009.

Gang Qian, Hyun-Jeong Seok, Qiang Zhu and Sakti Pramanik, "Space-Partitioning-Based Bulk-Loading for the NSP-Tree in Non-ordered Discrete Data Spaces", Proc. of the 19th International Conference on Database and Expert Systems Applications (DEXA'08), pp 404-418, Turin, Italy, Sept. 2008.

Hyun-Jeong Seok, Gang Qian, Qiang Zhu, Alexander Oswald and Sakti Pramanik, "Bulk-Loading the ND-Tree in Non-ordered Discrete Data Spaces", Proc. of 13th International Conference on Database Systems for Advanced Applications (DASFAA'08), pp 156-171, New Delhi, India, March 19 - 22, 2008.

Qiang Zhu, Brahim Medjahed, Anshuman Sharma and Henry Huang, "The Collective Index: A Technique for Efficient Processing of Progressive Queries", The Computer Journal, Vol. 51, No. 6, pp 662-676, 2008.

Dashiell Kolbe, Qiang Zhu and Sakti Pramanik, "On k-Nearest Neighbor Searching in Non-Ordered Discrete Data Spaces", Proc. of the 23rd IEEE International Conference on Data Engineering (ICDE'2007), pp 426-435, Turkey, April 2007.

Gang Qian, Qiang Zhu, Qiang Xue and Sakti Pramanik, "Dynamic Indexing for Multidimensional Non-ordered Discrete Data Spaces Using a Data-Partitioning Approach", ACM Transactions on Database Systems (TODS), Vol. 31, No. 2, pp 439-484, 2006

Gang Qian, Qiang Zhu, Qiang Xue and Sakti Pramanik, "A Space-Partitioning-Based Indexing Method for Multidimentional Non-ordered Discrete Data Spaces", ACM Transactions on Information Systems (TOIS), Vol. 23, No. 1, pp 79-110, 2006

Qiang Xue, James Cole and Sakti Pramanik, "Sequence Homology Search Based on Database Indexing Using the Profile Hidden Markov Model", Proc. of IEEE International Conference on Bioinformatics and Bioengineering (BIBE'06), pp 135-140, Washington, DC, Oct. 2006.

Gang Qian, Qiang Zhu, Qiang Xue and Sakti Pramanik, "The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces", Proc. of 29th International Conference on Very Large Data Bases (VLDB'03), pp 620-631, Berlin, Germany, Sept. 9 - 12, 2003

Project Web Sites

Web site at MSU

Web site at UM-D