# Accelerate Pairwise Statistical Significance Estimation using HPC Technologies

## 1. Introduction

### 1.1 Why Pairwise Statistical Significance?

Biologists often use pairwise alignment programs to identify similar, or more specifically, related sequences or homologs (having common ancestor). A typical pairwise alignment program aligns two sequences and constructs an alignment with maximum similarity score. In general, more related sequences will have higher similarity scores. However, the alignment score depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence compositions.

Thus, judging the relationship between two sequences solely based on the scores may lead to a wrong conclusion. The biological significance of a pairwise sequence alignment or the potential relatedness of the two sequences being aligned is better gauged by the statistical significance of the alignment score rather than by the alignment score alone.

The statistical significance of the resulting sequence alignment score between the sequences can be presented by the probability (i.e., P-value) that random or unrelated sequences could be aligned to generate the same score.

In general, there are two primary methods to estimate the statistical significance of local sequence alignment. One is called database statistical significance (DSS) reported by many popular database search programs, such as BLAST, FASTA, and SSEARCH (using full implementation of Smith-Waterman algorithm). This method depends on the size and composition of the database being searched. The other method is called the pairwise statistical significance (PSS) proposed by Dr. Ankit Agrawal , which is specific to the sequence-pair being aligned, and independent of any database.

Pairwise statistical significance is an attempt to make the statistical significance estimation procedure more specific to the sequence pair being compared. In addition to not needing a database to estimate the statistical significance of an alignment, pairwise statistical significance is shown to be more accurate than database statistical significance reported by popular database search programs like BLAST, PSI-BLAST, and SSEARCH.

PSS has been used to reorder the hits from a fast database search program like PSI-BLAST. It has also been used for DNA alignment in some cases as well. There also exist some other approaches to estimate PSS, including a clustering-classification approach, which is a lookup-based approach and therefore not very accurate. More details can be found at the websites of Dr. Ankit Agrawal.

### 1.2 Solutions to Accelerate PSSE

Although pairwise statistical significance estimation (PSSE) has been shown to be accurate, it involves thousands of such permutations and alignments, which are enormously time consuming and can be impractical for estimating pairwise statistical significance of a large number of sequence pairs. For instance, for our experiments with 86 query sequences and 2771 subject sequences, the sequential implementation takes more than 32 hours. Bigger data sets demand more computing power.

Therefore, applying high performance computing (HPC) techniques provides more opportunities of accelerating the estimation in reasonable time. Multi-core computers and clusters have become increasingly ubiquitous and more powerful. Therefore, it is of interest to unlock the potential of multi-core desktops and clusters using high performance technologies. On the other hand, with general purpose graphics processing units (GPGPUs) becoming increasingly powerful, inexpensive, and relatively easy to program, it has become a very attractive hardware acceleration platform. Those strongly motivate the use of multi-core CPUs,many-core GPUs,or FPGAs to accelerate the PSSE.

#### 2.1 Software: MPI-PSSE

The PSSE of single-pair sequences processes only one pair of query and subject sequences. Its implementations and some variants can be found in the webpage of Dr. Agrawal.

The multi-pair PSSE processes multiple queries and subject sequences in parallel. The benefits of HPC techniques are harvested usually when the bigger data are available. The acceleration strategy of Dr.Agrawal can be extended by Dr. Yuhong Zhang to support multi-pair PSSE. The source code of the MPI implementation can be downloaded here.

The software also outputs a number of statistics about the sequence alignment technique used, which was used to compare retrieval accuracy of PSSE.

#### How to run the program

``` % aprun ./mpi-psse
VersionDate: 06/20/11
SSM: Standard substitution matrix
PSSM: Position-specific substitution matrix
---------------------------------------------------------------
Usage:
./EXENAME [options]
-query <string> : specify sequence queryname file
-db <string>    : specify database
-cls <string>   : specify classification file
-mtype <integer>        : specify the type of matrix
0: SSM, 1: PSSM
-mat <string>   : specify the path for substitution matrix
For SSM, need to specify the file name
For PSSM, JUST need specify the path that includes PSSM files
-etype <integer>        : specify the type of estimation
0 : for only normal; 1 : for only non-conservative.
2 : conservative pairwise statistical significance
3 : for average statistical significance
-num <integer>  : specify number of shuffles (default: 1000)

```

Example:

(1) use 128 MPI tasks to accelerate PSSE

```aprun -n 128 ./mpi-psse -query CATH/queries -db CATH/database_seqs.fa -cls CATH/classification -mtype 1 -mat PSSM-All/ -etype 0 -num 1000
```

Notes:

1. aprun is the command of wrapper of mpirun in NERSC platform. in the case of your own machine, you may need to use 'mpirun' command to run a MPI program.

2. As for the parameter '-cls', it specifies the classification file, which is just for evaluation of pairwise statistical significance. The classification file describes the accurary classification of each sequence in the test sequence set (2,771 sequence in total). PSSE also can detect the homolog of the pairwise sequences. Therefore, we can verify the output of PSSE with the true classification (usually given by human expert). See here for more details of the evaluation.

If you just plan to obtain the P value via PSSE,leave the parameter alone. However, you may need to revise the source code accordingly. For example, you can use '#define NEED_STAT 0' instead of '#define NEED_STAT 1' in MPIPairwiseStatSig.cpp.

#### 2.2 Software: Par-PSSE (using hybrid paradigms:OpenMP/MPI)

Par-PSSE is a multi-core CPU based implementation to accelerate PSSE for local sequence alignment. This software reaps the benefits of multi-core CPU by using MPI or/and OpenMP and their hybrid paradigms.

##### Software Requirements:

1. GCC 4.4+

2. Linux-based Operating System

#### Subject sequences 2,771 in total, -db specifies database in the example above)( Query sequences (86 of 2,771, -query specifies queries in the example above) Classification file (for 2771 sequences) PSSM (for 2,771 sequences) Standard Substitution Matrix ( BLOSUM45/50/61/80/100) The latest Swiss-Prot protein database

##### Usage:
```
./Par-PSSE [options]
-query <string> : specify sequence query name file
-db <string>    : specify database
-cls <string>   : specify classification file
-mtype <integer>        : specify the type of matrix
0: SSM, 1: PSSM
-mat <string>   : specify the path for substitution matrix
For SSM, need to specify the file name
For PSSM, JUST need specify the path that include PSSM files
-num <integer>  : specify number of shuffles (default: 1000)
-thr <integer>  : specify number of threads (default: 1)
-ptype <integer>: specify parallelism type(default: 0)
( 0 : inter-parallelism, 1: intra-parallelism )
```

Example :

1. use 1 process and 24 threads to accelerate PSSE. Assume that the substitution matrix is PSSM, the number of shuffles is 1000, and parallelism type is inter-parallelism.

```aprun -n 1 -d 24 ./Par-PSSE -query CATH/queries -db CATH/database_seqs.fa -cls CATH/classification -mtype 1 -mat PSSM/ -num 1000  -thr 24 -ptye 0
```

Note:

"aprun -n 1 -d 24" is the configurations of NERSC machine (NUMA), if you just try the program in your own machin, simply replace it with "mpirun 1". There is no need to specify the number of threads at the begining of the command.

#### 2.4 Software : CUDA-PSSE

CUDA-PSSE is a GPU implementation to accelerate PSSE for local sequence alignment using standard substitution matrix and position specific score matrix (PSSM). This software of PSSE takes advantage of NVIDIA's CUDA programming environment.

#### Software Requirements:

1. CUDA toolkits and SDK 4.0
2. Single or multiple CUDA-enabled GPUs with compute capability 2.0 and higher
3. GCC 4.4+
4. Linux-based Operating System

```
\$ ./CUDA-PSSE
Usage:./CUDA-PSSE [options]Standard options:
-query <string> : specify sequence query-name file
-db <string>    : specify the pair-sequence file
-cls <string>   : specify the classification file
-pssm <string>  : specify the PATH that include PSSM files
-num <string>   : specify the number of shuffling second sequnece
-gapo <integer> : specify the gap open panelty (0 ~ 255), (default 11)
-gape <integer> : specify the gap extension panelty (0 ~ 255), (default 1)
-use_single <integer>   : force to use single-gpu (default: false)

```

Sample of output statistics file(.scv file)

## 3. Citations

1. A. Agrawal, V. Brendel, and X. Huang. Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. In Bioinformatics Research and Applications, vol. 4983, LNCS(LNBI), pp. 50–61, Springer Berlin/Heidelberg (2008).
2. A. Agrawal and X. Huang, PSIBLAST PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance, Bioinformatics. 25(8),1082–1083 (2009).
3. A. Agrawal and X. Huang, Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices, IEEE Transactions on Computational Biology and Bioinformatics. 8(1), 194–205 (2011).
4. A. Agrawal and X. Huang, Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty, BMC Bioinformatics. 10(Suppl 3), S1 (2009).
5. A. Agrawal, A. Ghosh, and X. Huang. Estimating pairwise statistical significance of protein local alignments using a clustering-classification approach based on amino acid composition. In eds. I. Mandoiu, R. Sunderraman, and A. Zelikovsky, Bioinformatics Research and Applications, vol. 4983, LNCS(LNBI), pp. 62–73, Springer Berlin/Heidelberg (2008).
6. A. Agrawal, V. P. Brendel, and X. Huang, Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment, International Journal of Computational Biology and Drug Design. 1(4), 347–367 (2008).
7. A. Agrawal, S. Misra, D. Honbo, and A. N. Choudhary, Parallel pairwise statistical significance estimation of local sequence alignment using message passing interface library, Concurrency and Computation: Practice and Experience. 23(17), 2269–2279 (2011).
8. Yuhong Zhang, Sanchit Misra, Daniel Honbo, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. Efficient Pairwise Statistical Significance Estimation for Local Sequence Alignment Using GPU. In the 1st IEEE International Conference on Computational Advances in Bio and medical Sciences, pp. 226-231, February 2011.
9. Yuhong Zhang, Sanchit Misra, Ankit Agrawal, Md. Mostofa Ali Patwary, Wei-keng Liao, Zhiguang Qin, and Alok Choudhary. Accelerating Pairwise Statistical Significance Estimation for Local Alignment by Harvesting GPU's Power. BMC Bioinformatics, 2012, BMC Bioinformatics, 13(Suppl 5):S3 doi:10.1186/1471-2105-13-S5-S3. PDF
10. Yuhong Zhang, Md. Mostofa Ali Patwary, Sanchit Misra, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. Enhancing Parallelism of Pairwise Statistical Significance Estimation for Local Sequence Alignment. Proceedings of HiPC Workshops, Workshop on Hybrid Multi-core Computing, pp. 1-8, Dec. 2011.

11. Yuhong Zhang, Yunbo Rao. Mining genomic sequence data for related sequences using pairwise statistical significance. (Book Chapter) in "Biological Data Mining and Its Applications in Healthcare" pp. 67-101, 2013, World Scientific. PDF

12. Yuhong Zhang, Feng Chen. High performance sequence mining using pairwise statistical significance. (Book Chapter) Advances in Parallel Computing, Volume 24: Transition of HPC Towards Exascale Computing, pp. 194 - 214, 2013, IOS press (DOI10.3233/978-1-61499-324-7-194).PDF