Following are the software tools that I have (co-)developed. Unless otherwise stated, all the tools are provided under CC-BY-NC license. Users are welcome to negotiate other license terms with us, if needed. If you use any of these tools for academic purposes, please cite the related publications.
Parallel Data Clustering Algorithms
Clustering is a data mining technique that groups data into meaningful subclasses, known as clusters, such that it minimizes the intra-differences and maximizes inter-differences of these subclasses. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm. Even though DBSCAN is a well-known and popular for its capability of discovering arbitrary shaped clusters and eliminating noise data, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
Software for parallel DBSCAN in MPI and OpenMP, along with sample dataset is available here.
Steel Fatigue Strength Predictor
The tool estimates the fatigue strength of the steel based on the composition and processing parameters entered by the user, and is built using Japan NIMS experimental database.
The online steel fatigue strength predictor is available at here
Formation Energy Predictor
This tool takes the compound composition as input (without any structure information), and can predict the formation energy of a compound. These models were recently used to scan (almost) the entire ternary composition space, and resulted in a first-of-its-kind computational discovery of about 4,500 new stable compounds. In addition, newer more accurate models developed later are also deployed in this tool.
The online formation energy predictor is available here
The tool estimates the Seebeck coefficient, which is a key thermoelectric property of a material, based on the composition and production method of the material, and is built using a set of 300 experimental thermoelectric materials.
The online thermoelectric toolkit is available here
Five Year Life Expectancy Calculator
The calculator estimates a given patient’s chances of surviving at least 5 years from the time of the last hospital visit, using a small non-redundant subset of 24 patient attributes. Patient-specific 5-year survival probability is depicted in context with the survival probability of a healthy and sick patient.
The online five year life expectancy calculator for older adults is available here
Lung Cancer Outcome Calculator
We have developed an online lung cancer outcome calculator to estimate the patient-specific risk for mortality due to lung cancer using ensemble data mining on data from the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute. The calculator includes data mining models to predict 6-month, 9-month, 1-year, 2-year, and 5-year survival for lung cancer, which can be useful for doctors and physicians for decision-making. One of the design objectives for the outcome calculator was to use minimum attributes (to facilitate easy entry) while not sacrificing on the front of accuracy. For this purpose, we used several attribute-selection techniques in conjunction with domain knowledge. The calculator uses 13 attributes, and has a prediction accuracy of 91.2%, with a discriminatory c-statistic of 0.937.
The online lung cancer outcome calculator is available here.
Sentiment Analysis for Social Media Data
Social Media Data like Facebook, Twitter, blogs, etc. is currently growing in an exploding rate. Automated sentiment analysis techniques can be extremely helpful to mine this ever-increasing data and extract actionable insights in real-time. We have developed a multi-lingual sentiment elicitation system, which is capable of: 1) identifying language for input sentence; 2) identifying sentiment of the input sentence. The experimental results on social media data like Facebook comments and Twitter tweets show that we can get highly accurate sentiment identification.
An API for sentiment analysis service and benchmark data is available here.
Analyzing the Variation in Hospital Billing using Medicare Data
Analysis of recently released Medicare data reveals interesting insights about arbitrary variation in hospital billing and medicare payments. The variation in hospital billing was not only found to be much more than the variation in medicare payments, but the two variations are also not well correlated at all, raising important questions as to how hospitals determine the value of their services.
Visually depicted results in the form of US state heat maps can be found here.
Real-Time Disease Surveillance using Social Media Data
Social media is producing massive amounts of data at an unprecedented scale, where people share their experiences and opinions on a variety of different things, including healthcare-related topics, like health conditions, their symptoms, treatments, side-effects, and so on. This makes the publicly available social media data an invaluable resource for mining such data to discover interesting and actionable healthcare insights. We have developed an online resource for real-time disease surveillance that we have developed using spatial, temporal, and text mining on Twitter data. The real-time analysis results are subsequently reported visually in terms of a US disease surveillance map, distribution and timelines of disease types, symptoms, and treatments, in addition to overall disease activity timeline. Such a surveillance system can be very useful for early prediction of disease outbreaks, which in turn can facilitate faster and better response preparation. Further, the resulting insights are also expected to be very useful for both patients and doctors to make informed decisions.
The real-time disease surveillance tool for flu is available here.
Poll: Identifying High-Impact Contributions of an Article
The body of scientific literature is growing yearly, presenting new challenges in accurate retrieval of relevant publications. Citation sentences stand to be a useful way to concisely represent the main contributions of a publication. We have developed a online tool called Poll, a prototype of an academic search engine which utilizes citation sentences to indicate the most important contributions of a cited publication.
The prototype version of Poll is available here.
Pairwise Statistical Significance Estimation
Biologists often use pairwise alignment programs to identify similar, or more specifically, related sequences or homologs (having common ancestor). A typical pairwise alignment program aligns two sequences and constructs an alignment with maximum similarity score. In general, more related sequences will have higher similarity scores. However, the alignment score depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence compositions. Thus, judging the relationship between two sequences solely based on the scores may lead to a wrong conclusion. The biological significance of a pairwise sequence alignment or the potential relatedness of the two sequences being aligned is better gauged by the statistical significance of the alignment score rather than by the alignment score alone. The statistical significance of the resulting sequence alignment score between the sequences can be presented by the probability (i.e., P-value) that random or unrelated sequences could be aligned to generate the same or higher score.
In general, there are two methods to estimate the statistical significance of local sequence alignment. One is called database statistical significance (DSS) reported by many popular database search programs, such as BLAST, FASTA, and SSEARCH (using full implementation of Smith-Waterman algorithm). This method depends on the size and composition of the database being searched. The other method is called the pairwise statistical significance (PSS), which is specific to the sequence-pair being aligned, and independent of any database. Pairwise statistical significance is an attempt to make the statistical significance estimation procedure more specific to the sequence pair being compared. In addition to not needing a database to estimate the statistical significance of an alignment, pairwise statistical significance is shown to be more accurate than database statistical significance reported by popular database search programs like BLAST, PSI-BLAST, and SSEARCH.
Sequential codes for pairwise statistical significance estimation and its variants are available here. Parallel codes for optimized HPC implementations (on MPI, OpenMP, Hybrid, GPU) of pairwise statistical significance estimation are available here.
AGILE: Long Read Sequence Mapping
Recent advances in Next Generation Sequencing (NGS) technology have led to affordable desktop-sized sequencers with low running costs and high throughput. These sequencers produce small fragments of the genome being sequenced as a result of the sequencing process. By mapping these small fragments (reads) to a reference genome, we can sequence the DNA of a new individual. The NGSs are making it possible for these studies to be conducted at a mass scale. This is believed to usher an era of personal genomics when each individual can have his/her dna sequenced and studied to come up with more personalized ways of anticipating, diagnosing and treating diseases.
AGILE is a sequence mapping tool specifically designed to map the longer reads (read length > 200) to a given reference genome. AGILE has both high sensitivity (~99.8%) and high speed (maps >1 million reads of length 500 to a reference human genome per hour). At this rate, AGILE will need only 6 hours for a 1X coverage of the human genome.
Software and data for AGILE are available here.
DNAlignTT: DNA Alignment with Sequence-Specific Transition-Transversion Ratio
DNAlignTT is an iterative approach to DNA pairwise sequence alignment with sequence-specific transition-transversion ratios. In each iteration, appropriate substitution matrices are generated using the transition-transversion ratio for the given sequence pair estimated by the alignment constructed in the previous iteration. The iterations are continued until the alignment converges, or until a maximum number of iterations.
Source code for DNAlignTT is available here.
FATBEP: Fuzzy Adaptive Thresholding Based Exon Predictor
Thresholding is always critical and decisive in problem solving. We have developed a fuzzy logic-based adaptive thresholding approach for exon prediction problem (FATBEP), which is an important problem in bioinformatics. A well known method to identify the coding regions (exons) in nucleotide sequences uses a threshold on the frequency component at f = 1/3 in the nucleotide sequence, and the regions of high period-3 component are declared as exons. The proposed approach allows the thresholds to vary along the dataset based on the local statistical properties. We incorporate it in a soft computing framework of training and testing with an attempt to determine the optimum adaptive thresholds. The search space of the trained database is reduced by determining a dynamic range of thresholds using fuzzy logic rules formulated for the exon prediction problem.
An implementation of the approach as a user friendly GUI in MATLAB is freely available for download here.