My research deals with high performance big data mining and their applications in materials science, healthcare, social media, bioinformatics, etc., with most of the applications research done in collaboration with other researchers in respective fields.
Our ability to collect huge amounts of data (popularly known as big data) in practically all fields has greatly surpassed our analytical capability to make sense of it, underscoring the emergence and popularity of the Fourth paradigm of science, which is data-driven science and discovery. The challenge in big data mining lies not only in the size and scale of the data, but also its complexity – high-dimensional, multi-scale, spatio-temporal, and other types of complex data are becoming more commonplace. Further, different application domains introduce their own challenges and constraints. My research on high performance data mining aims at a coherent integration of high performance computing (HPC) and data mining, so as to address these challenges and enable large-scale data-guided discovery in various application domains.
High performance data mining
Collaborators: Prof. Alok Choudhary, Prof. Wei-keng Liao, Dr. Mostofa Ali Patwary
We envision creating a library of a variety of highly optimized data mining algorithms, which have the ability to deal with big data on tens of thousands of processors with good scalability. As an example, we recently redesigned a couple of density-based clustering algorithms (DBSCAN and OPTICS), and hierarchical clustering, using graph algorithmic techniques, which enabled their otherwise difficult parallelization, and demonstrated scalability on the order of ten thousand processors. We plan to similarly scale other data mining algorithms and demonstrate them on real world applications. Further, we plan to leverage the emerging architectures and memory storage technologies in the future, to do several architecture-aware optimizations to further enhance performance.
Collaborators: Prof. Marc De Graef, Prof. Christopher Wolverton, Prof. Surya Kalidindi, Prof. Veera Sundararaghavan, Prof. Gregory B. Olson, Prof. Peter W. Voorhees
The over-arching goal here is to better understand processing-structure-property-performance (PSPP) linkages, and develop analytical tools and methods which can enable automated discovery of materials with desired target properties. For example, we recently developed a data-driven framework to enable ultra-high-throughput discovery of stable materials using predictive modeling. We plan to attempt similar discoveries for numerous other materials properties, like hardness, band gaps, etc., and also develop methodologies to mine microstructure data to facilitate process design. We have developed and released several software for materials property prediction.
Collaborators: Dr. Jai Raman, Dr. David Baker, Dr. Karl Bilimoria, Dr. Mark Russo, Dr. Margaret Danilovich
The goal here is to develop methods to effectively analyze the huge amounts of clinical and biomedical data. Some ongoing projects include developing predictive models for outcomes of lung cancer, colorectal cancer surgery, and lung transplant. We plan to conduct large-scale high-dimensional data-driven analytics on heterogeneous healthcare-related data, like electronic health records, genomics/proteomics data, and professional biomedical literature, and further incorporate the resulting insights into healthcare in practice with the help of our collaborators. We have developed and released several software for healthcare outcome prediction.
Social media analytics
Collaborator: Prof. Alok Choudhary
There is an increasing need to uncover the wealth of information hidden in huge amounts of publicly available textual information in the form of social media websites, forums, blogs, research publications and reports, and so on. Over the last few years, we have done significant research on sentiment analysis, web-text clustering, recommendation systems, behavioral targeting, and other related problems. We plan to further build on our previous work in terms of improving accuracy and scaling our approaches to larger and real-time data. Further, we would like to apply these techniques in specific domains for discovering interesting insights. Read more here.
Collaborators: Prof. Xiaoqiu Huang, Dr. Sanchit Misra
Sequence-structure-function relationships form the basis of almost everything in bioinformatics, but it is far from well-understood. Sequence data is the most widely available and ever-increasing form of data in bioinformatics. I have conducted significant research in applying high performance data mining techniques on sequence data, and released several serial and parallel codes for estimation of pairwise statistical significance estimation for biological sequence alignment, which is used for the purpose of identifying related sequences. We have also released a fast sequence mapping tool for long read mapping, and are working on developing context-sensitive transcriptional networks from heterogenous big data.
I am very interested in applying high performance big data mining approaches to challenging research problems in other application domains, such as climate science, astrophysics, and cyber-physical systems.