Introduction
Sequence alignment
Sequence alignment is a standard method to compare two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences [1]. Also, it is a way of arranging two or more sequences of characters to recognize regions of similarity [2].
Importance of sequence alignment
Sequence alignment is significant because in bimolecular sequences (DNA, RNA, or protein), high sequence similarity usually implies important functional or structural similarity that is the first step of many biological analysis [3]. Besides, sequence alignment can address significant questions such as detecting gene sequences that cause disease or susceptibility to disease, identifying changes in gene sequences that cause evolution, finding the relationship between various gene sequences that can indicate the common ancestry [4], detecting functionally important sites, and demonstrating mutation events [5].
Analysis of the alignment can reveal important information. It is possible to identify the parts of the sequences that are likely to be important for the function, if the proteins are involved in similar processes .The random mutations can accumulate more easily in parts of the sequence of a protein which are not very essential for its function. In the parts of the sequence that are essential for the function hardly any mutations will be accepted because approximately all changes in such regions will destroy the function [6]. Moreover, Sequence alignment is important for assigning function to unknown proteins [7]. Protein alignment of two residues implies that those residues perform similar roles in the two different proteins [8].
Methods
The main purpose of sequence alignments methods is finding maximum degree of similarities and minimum evolutionary distance. Generally, computational approaches to solve sequence alignment problems can be divided into two categories: global alignments and local alignments. Global alignments traverse the entire length of all query sequences, and match as many characters as possible from end to end. These alignment methods are most useful when the sequences have approximately the same size or they are similar. The alignment is performed from beginning of the sequence to end of the sequences to find out the best possible alignment. On the other hand, Local alignments find the local regions with high level of similarity. They are more useful for sequences that are suspected to contain regions of similarity within their larger sequence context. [9]
Besides, pairwise sequence alignment is used to find the regions of similarity between two sequences. As the number of sequences increases, comparing each and every sequence to every other may be impossible. So, we need multiple sequence alignment, where all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the same position in each sequence. [10]
There are many different approaches and implementations of the methods to perform sequence alignment. These include techniques such as dynamic programming , heuristic algorithms (BLAST and FASTA similarity searching), probabilistic methods, dotmatrix methods, progressive methods, ClustalW , MUSCLE , TCoffee , and DIALIGN.
Dynamic programming
Dynamic programming (DP) is a problem solving method for a class of problems that can be solved by dividing them down into simpler subproblems. It finds the alignment by giving some scores for matches and mismatches (Scoring matrices).This method is widely used in sequence alignments problems. [11] However, when the number of the sequences is more than two, multiple dimensional Dynamic programming in infeasible because of the large storage and computational complexities.[16]
Dynamic programming algorithms use gap penalties to increase the biological meaning [9]. There are different gap penalties such as linear gap, constant gap, gap open and gap extension. The gap score is a penalty given to alignment when there is insertion or deletion. There may be a case where there are continuous gaps all along the sequence during the evolution, so the linear gap penalty would not be suitable for the alignment. Therefore, gap opening penalty and gap extension penalty has been introduced when there are continuous gaps. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty which will be less compared to the open penalty. Different gap penalty functions require different dynamic programming algorithms [12]. Also; there is a substitution matrix to score alignments. The mainly used predefined scoring matrices for sequence alignment are PAM (Point Accepted Mutation) and BLOSUM (Blocks Substitution Matrix).
The two algorithms, SmithWaterman for local alignment and NeedlemanWunsch for global alignment, are based on dynamic programming.
NeedlemanWunsch algorithm requires alignment score for a pair of residues to be equal or more than zero. No gap penalty is required, and score cannot decrease between two cells of pathway. SmithWaterman requires a gap penalty to work efficiently. Residue alignment score may be positive or negative .Score can increase, decrease, or stay level between two cells of pathway [13].
Sequence Alignment Problems
For an ncharacter sequence s, and an mcharacter sequence t , we construct an (n+1)×(m+1)matrix .
Global alignment: F ( i, j ) = score of the best alignment of s[1…i ] with t[1…j]
Local alignment: F ( i, j ) = score of the best alignment of a suffix of s[1…i ] and a suffix of t[1…j]
There are three steps in the sequence alignments algorithms:
In the initialization phase, we assign values for the first row and column of the alignment matrix .The next step of the algorithm depends on this.
In the fill stage, the entire matrix is filled with scores from top to bottom, left to right with appropriate values that depend on the gap penalties and scoring matrix.
For each F ( i, j ), save pointers to cell that resulted in best score . For global alignment, we trace pointers back from F (m, n) to F(0, 0) to recover sequence alignments . For local alignment, we are looking for the maximum value of the F (i, j) that can be anywhere in the matrix. We trace pointers back from F (i, j) and stop when we get to a cell with value 0.
After creating and initializing the alignment matrix ( F ) and trace back matrix, the score of F (i, j) for every cell is calculated as follows:
For i = 1 to n+1
For j = 1 to m+1
left_score= F[i][ j1] – gap,
diagonal_score=F[i1[ j1] + PAM250(s[i], t[j]),
up_score= F[i1][ j] – gap
scores=max[ 0, left_score, diagonal_score, up_score]
Also, we should keep the reference to each cell to perform backtracking.
traceback_matrix[i][j]= scores.index(F[i][j])
After filling the F matrix, we find the optimal alignment score and the optimal end points by finding the highest scoring cell, maxi,jF(i , j) . best_score has a default value equals to 1 .
if F [i][j] > best_score:
best_score= F [i][j]
i_maximum_score, j_maximum_score = i, j
To recover the optimal alignment, we trace back from i_maximum_score, j_maximum_score position , terminating the trace back when we reach a cell with score 0 .
The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.
For this problem, there are gap opening penalty and gap extension penalty. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty.
Initialization:
There are Four different matrices: up_score , left_score ,m_score , trace_back
Filling matrix:
For i = 1 to n+1:
up_score[i][0] = gap_opening_penalty(i1)*gap_extension_penalty
For j = 1 to m+1:
left_score[0][j] = gap_opening_penalty(j1)*gap_extension_penalty
For i = 1 to n+1:
For j = 1 to m+1:
up_score [i][j] = max(
[up_score [i][j1] – gap_extension_penalty,
m_score[i][j1] – gap_opening_penalty]
)
Left_score[i][j] = max(
[left_score[i1][j] – gap_extension_penalty,
m_score[i1][j] – gap_opening_penalty]
)
m_score[i][j] = BLOSUM62 (s[i], t[j])) +max(
m_score [i1][j1],
left_score [i1][j1],
up_score [i1][j1]
)
scores = [left_score[i1][j1], m_score[i1][j1] ,up_score[i1][j1], 0]
We find the highest scoring cell, the position of that cell,and the best alignment by following the same steps as we accomplished in the previous problem.
The time and space complexity of this algorithm is O(mn).
In this case every gap receives a fixed score, regardless of the gap length
For i = 1 to m+1:
alignment_matrix[i][0] = gap_penalty
For i = 1 to n+1:
alignment_matrix[0][j] = gap_penalty
For i = 1 to n+1:
For j = 1 to m+1:
scores = [alignment_matrix[i][j1] – gap_penalty,alignment_matrix[i1][j] – gap_penalty, alignment_matrix[i1][j1] + BLOSUM62 (s[i], t[j]),)
alignment_matrix[i][j] = max(scores)
alignment_matrix[m][n] holds the optimal alignment score.
The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.
In this problem there is a linear gap that each inserted or deleted symbol is charged g; as a result, if the length of the gap L; the total gap penalty would be the product of the two gL.
For i = 1 to m+1:
alignment_matrix[i][0] = i*gap_penalty
For i = 1 to n+1:
alignment_matrix[0][j] = j*gap_penalty
scores = [alignment_matrix[i][j1] – gap_penalty,alignment_matrix[i1][j] – gap_penalty, alignment_matrix[i1][j1] + BLOSUM62 (s[i], t[j]),)
alignment_matrix[i][j] = max(scores)
alignment_matrix[m][n] holds the optimal alignment score.
The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.
There are Four different matrices: up_score , left_score ,m_score , trace_back
Filling matrix:
For i = 1 to n+1:
up_score[i][0] = gap_opening_penalty(i1)*gap_extension_penalty
For j = 1 to m+1:
left_score[0][j] = gap_opening_penalty(j1)*gap_extension_penalty
For i = 1 to n+1:
For j = 1 to m+1:
up_score [i][j] = max(
[up_score [i][j1] – gap_extension_penalty,
m_score[i][j1] – gap_opening_penalty]
)
Left_score[i][j] = max(
[left_score[i1][j] – gap_extension_penalty,
m_score[i1][j] – gap_opening_penalty]
)
m_score[i][j] = BLOSUM62 (s[i], t[j])) +max(
m_score [i1][j1],
left_score [i1][j1],
up_score [i1][j1]
)
maximum_alignment_score = max(m_score[m][n], left_score[m][n], up_score[m][n])
The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.
The above algorithms require too much time for searching large databases so we cannot use these algorithms. There are several methods to overcome this problem.
Heuristic Method
It is an algorithm that gives only approximate solution to a problem. Sometimes we are not able to formally prove that this solution actually solves the problem, but since heuristic methods are much faster than exact algorithms, they are commonly used . FASTA is a heuristic method for sequence alignment .The main idea of this method is choosing regions of the two sequences that have some degree of similarity, and using dynamic programming to compute local alignment in these regions. The disadvantage of using these methods is losing significant amount of sensitivity. Parallelization is a possible solution for solving this problem.[14]
Parallel Algorithm
In this paper [ 15 ] a parallel method is introduced to reduce the complexity of the dynamic programming algorithm for pairwise sequence alignment. The time consumption of sequential algorithm mainly depends on the computation of the score matrix .For calculating the score of each cell, the computation of F(i,j) can be started only when F(i1,j1), F(i1,j) and F(i,j1) acquire their values. Consequently, it is possible to conduct the computation of score matrix sequentially in order of antidiagonals .So, the values in the same antidiagonal can be calculated simultaneously. ( Figure 1 )
A 
G 
C 
T 

0 
1 
2 
3 
ïƒ 

A 
1 
0 
1 
ïƒ 

G 
2 
1 
ïƒ 

C 
3 
ïƒ 

T 
ïƒ 
Figure1 .Computing score matrix in parallel manner .The values of the cells marked by ïƒ can be computed simultaneously.
There are two models for problem solving using parallel method that improve the performance of the pairwise alignment algorithm.
Pipeline model: Each row of the score matrix is computed successively by a processor, which blocks itself until the required values in the above row are computed.
Antidiagonal model: From the lefttop corner to the rightbottom corner of score matrix, all processors compute concurrently along an antidiagonal of the matrix. Each idle processor selects a cell from the current antidiagonal and computes its value. When all values in current antidiagonal are computed, the computation moves on to next antidiagonal.
In the algorithm that is based on the pipeline model, the score matrix is partitioned into several blocks by column and several bands by row. All the bands distributed to multiple processors, and each processor computes the block in its own band simultaneously.
By applying parallel algorithm, The time complexity is O(n) when n processor is used. [15]
Progressive Method
For solving multiple sequence alignment problems, the most common algorithm used is progressive method. This algorithm consists of three main stapes. First, comparing all the sequences with each other, and producing similarity scores ( distance matrix) . This stage is parallelized. The second stapes groups the most similar sequences together using the similarity scores and a clustering method such as NeighborJoining to create a guide tree. Finally, the third stage sequentially aligns the most similar sequences and groups of sequences until all the sequences are aligned. Before alignment with a pairwise dynamic programming algorithm, groups of aligned sequences are converted into profiles. A profile represents the character frequencies for each column in an alignment. In the final stage, for aligning groups of sequences, trace back information from full pairwise alignment is required.[ 17 ]
ClustalW
This algorithm that has become the most popular for multiple sequence alignment implements progressive method. The time complexity of this method is O (N ^{4} + L ^{2}) and the space complexity is O (N^{2} + L ^{2}). [18]
Conclusion
By comparing the different methods to implement pairwise sequence alignment and multiple sequence alignment , we can conclude that using parallel algorithms that implement pipeline model or antidiagonal model are effective algorithm for performing pairwise sequence alignments. The algorithms that implement progressive method such as ClustalW are effective algorithm for solving multiple sequence alignments problems.
References