Class: | Sequence alignment |
Time: | O(mn) |
Space: | O(mn) |
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970.[1] The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem.[2] It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.
This algorithm can be used for any two strings. This guide will use two small DNA sequences as examples as shown in Figure 1: GCATGCG GATTACA
First construct a grid such as one shown in Figure 1 above. Start the first string in the top of the third column and start the other string at the start of the third row. Fill out the rest of the column and row headers as in Figure 1. There should be no numbers in the grid yet.
G | C | A | T | G | C | G | |||
G | |||||||||
A | |||||||||
T | |||||||||
T | |||||||||
A | |||||||||
C | |||||||||
A |
Next, decide how to score each individual pair of letters. Using the example above, one possible alignment candidate might be: 12345678
The letters may match, mismatch, or be matched to a gap (a deletion or insertion (indel)):
Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores; some have been outlined in the Scoring systems section below. For now, the system used by Needleman and Wunsch[1] will be used:
For the Example above, the score of the alignment would be 0: +−++−−+− −> 1*4 + (−1)*4 = 0
Start with a zero in the first row, first column (not including the cells containing nucleotides). Move through the cells row by row, calculating the score for each cell. The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel. Take the maximum of the candidate scores for each of the three possibilities:
The resulting score for the cell is the highest of the three candidate scores.
Given there is no 'top' or 'top-left' cells for the first row only the existing cell to the left can be used to calculate the score of each cell. Hence −1 is added for each shift to the right as this represents an indel from the previous score. This results in the first row being 0, −1, −2, −3, −4, −5, −6, −7. The same applies to the first column as only the existing score above each cell can be used. Thus the resulting table is:
G | C | A | T | G | C | G | |||
0 | −1 | −2 | −3 | −4 | −5 | −6 | −7 | ||
G | −1 | ||||||||
A | −2 | ||||||||
T | −3 | ||||||||
T | −4 | ||||||||
A | −5 | ||||||||
C | −6 | ||||||||
A | −7 |
The first case with existing scores in all 3 directions is the intersection of our first letters (in this case G and G). The surrounding cells are below:
G | |||
0 | −1 | ||
G | −1 | X |
This cell has three possible candidate sums:
The highest candidate is 1 and is entered into the cell:
G | |||
0 | −1 | ||
G | −1 | 1 |
The cell which gave the highest candidate score must also be recorded. In the completed diagram in figure 1 above, this is represented as an arrow from the cell in row and column 2 to the cell in row and column 1.
In the next example, the diagonal step for both X and Y represents a mismatch:
G | C | |||
0 | −1 | −2 | ||
G | −1 | 1 | X | |
A | −2 | Y |
X:
Y:
For both X and Y, the highest score is zero:
G | C | |||
0 | −1 | −2 | ||
G | −1 | 1 | 0 | |
A | −2 | 0 |
The highest candidate score may be reached by two of the neighboring cells:
T | G | ||
T | 1 | 1 | |
A | 0 | X |
In this case, all directions reaching the highest candidate score must be noted as possible origin cells in the finished diagram in figure 1, e.g. in the cell in row and column 6.
Filling in the table in this manner gives the scores of all possible alignment candidates, the score in the cell on the bottom right represents the alignment score for the best alignment.
Mark a path from the cell on the bottom right back to the cell on the top left by following the direction of the arrows. From this path, the sequence is constructed by these rules:
Following these rules, the steps for one possible alignment candidate in figure 1 are:
G → CG → GCG → -GCG → T-GCG → AT-GCG → CAT-GCG → GCAT-GCG A → CA → ACA → TACA → TTACA → ATTACA → -ATTACA → G-ATTACA ↓ (branch) → TGCG → -TGCG → ... → TACA → TTACA → ...
The simplest scoring schemes simply give a value for each match, mismatch and indel. The step-by-step guide above uses match = 1, mismatch = −1, indel = −1. Thus the lower the alignment score the larger the edit distance, for this scoring system one wants a high score. Another scoring system might be:
For this system the alignment score will represent the edit distance between the two strings.Different scoring systems can be devised for different situations, for example if gaps are considered very bad for your alignment you may use a scoring system that penalises gaps heavily, such as:
More complicated scoring systems attribute values not only for the type of alteration, but also for the letters that are involved. For example, a match between A and A may be given 1, but a match between T and T may be given 4. Here (assuming the first scoring system) more importance is given to the Ts matching than the As, i.e. the Ts matching is assumed to be more significant to the alignment. This weighting based on letters also applies to mismatches.
In order to represent all the possible combinations of letters and their resulting scores a similarity matrix is used. The similarity matrix for the most basic system is represented as:
A | G | C | T | ||
---|---|---|---|---|---|
A | 1 | −1 | −1 | −1 | |
G | −1 | 1 | −1 | −1 | |
C | −1 | −1 | 1 | −1 | |
T | −1 | −1 | −1 | 1 |
A | G | C | T | ||
---|---|---|---|---|---|
A | 1 | −1 | −1 | −1 | |
G | −1 | 1 | −1 | −1 | |
C | −1 | −1 | 1 | −1 | |
T | −1 | −1 | −1 | 4 |
Different scoring matrices have been statistically constructed which give weight to different actions appropriate to a particular scenario. Having weighted scoring matrices is particularly important in protein sequence alignment due to the varying frequency of the different amino acids. There are two broad families of scoring matrices, each with further alterations for specific scenarios:
When aligning sequences there are often gaps (i.e. indels), sometimes large ones. Biologically, a large gap is more likely to occur as one large deletion as opposed to multiple single deletions. Hence two small indels should have a worse score than one large one. The simple and common way to do this is via a large gap-start score for a new indel and a smaller gap-extension score for every letter which extends the indel. For example, new-indel may cost -5 and extend-indel may cost -1. In this way an alignment such as: GAAAAAAT G--A-A-Twhich has multiple equal alignments, some with multiple small alignments will now align as: GAAAAAAT GAA----Tor any alignment with a 4 long gap in preference over multiple small gaps.
Scores for aligned characters are specified by a similarity matrix. Here, is the similarity of characters a and b. It uses a linear gap penalty, here called .
For example, if the similarity matrix was
A | G | C | T | ||
---|---|---|---|---|---|
A | 10 | −1 | −3 | −4 | |
G | −1 | 7 | −5 | −3 | |
C | −3 | −5 | 9 | 0 | |
T | −4 | −3 | 0 | 8 |
then the alignment: AGACTAGTTAC CGA---GACGTwith a gap penalty of −5, would have the following score:
= −3 + 7 + 10 − (3 × 5) + 7 + (−4) + 0 + (−1) + 0 = 1
To find the alignment with the highest score, a two-dimensional array (or matrix) F is allocated. The entry in row i and column j is denoted here by
Fij
O(nm)
\Theta(min\{n,m\})
O(nm)
As the algorithm progresses, the
Fij
i=0,...c,n
j=0,...c,m
F0j=d*j
Fi0=d*i
Fij=max(Fi-1,j-1+S(Ai,Bj), Fi,j-1+d, Fi-1,j+d)
The pseudo-code for the algorithm to compute the F matrix therefore looks like this: d ← Gap penalty score for i = 0 to length(A) F(i,0) ← d * i for j = 0 to length(B) F(0,j) ← d * j for i = 1 to length(A) for j = 1 to length(B) Once the F matrix is computed, the entry
Fnm
Ai
Bj
Ai
Bj
Computing the score
Fij
O(1)
n
m
O(mn)
O(mn/logn)
n x m
O(mn).
The original purpose of the algorithm described by Needleman and Wunsch was to find similarities in the amino acid sequences of two proteins.
Needleman and Wunsch describe their algorithm explicitly for the case when the alignment is penalized solely by the matches and mismatches, and gaps have no penalty (d=0). The original publication from 1970 suggests the recursion
Fij=maxh<i,k<j\{Fh,j-1+S(Ai,Bj),Fi-1,k+S(Ai,Bj)\}
The corresponding dynamic programming algorithm takes cubic time. The paper also points out that the recursion can accommodate arbitrary gap penalization formulas:
A penalty factor, a number subtracted for every gap made, may be assessed as a barrier to allowing the gap. The penalty factor could be a function of the size and/or direction of the gap. [page 444]
A better dynamic programming algorithm with quadratic running time for the same problem (no gap penalty) was introduced later[5] by David Sankoff in 1972.Similar quadratic-time algorithms were discovered independentlyby T. K. Vintsyuk[6] in 1968 for speech processing("time warping"), and by Robert A. Wagner and Michael J. Fischer[7] in 1974 for string matching.
Needleman and Wunsch formulated their problem in terms of maximizing similarity. Another possibility is to minimize the edit distance between sequences, introduced by Vladimir Levenshtein. Peter H. Sellers showed[8] in 1974 that the two problems are equivalent.
The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. However, the algorithm is expensive with respect to time and space, proportional to the product of the length of two sequences and hence is not suitable for long sequences.
Recent development has focused on improving the time and space cost of the algorithm while maintaining quality. For example, in 2013, a Fast Optimal Global Sequence Alignment Algorithm (FOGSAA),[9] suggested alignment of nucleotide/protein sequences faster than other optimal global alignment methods, including the Needleman–Wunsch algorithm. The paper claims that when compared to the Needleman–Wunsch algorithm, FOGSAA achieves a time gain of 70–90% for highly similar nucleotide sequences (with > 80% similarity), and 54–70% for sequences having 30–80% similarity.
See main article: Computer stereo vision. Stereo matching is an essential step in the process of 3D reconstruction from a pair of stereo images. When images have been rectified, an analogy can be drawn between aligning nucleotide and protein sequences and matching pixels belonging to scan lines, since both tasks aim at establishing optimal correspondence between two strings of characters.
Although in many applications image rectification can be performed, e.g. by camera resectioning or calibration, it is sometimes impossible or impractical since the computational cost of accurate rectification models prohibit their usage in real-time applications. Moreover, none of these models is suitable when a camera lens displays unexpected distortions, such as those generated by raindrops, weatherproof covers or dust. By extending the Needleman–Wunsch algorithm, a line in the 'left' image can be associated to a curve in the 'right' image by finding the alignment with the highest score in a three-dimensional array (or matrix). Experiments demonstrated that such extension allows dense pixel matching between unrectified or distorted images.[10]