In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses divide and conquer.[1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.
Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If
X
Y
\operatorname{length}(X)=n
\operatorname{length}(Y)=m
O(nm)
O(nm)
O(nm)
O(min\{n,m\})
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:[3]
(Z,W)=\operatorname{NW}(X,Y)
(X,Y)
X=Xl+Xr
X
Yl+Yr
Y
\operatorname{NW}(X,Y)=\operatorname{NW}(Xl,Yl)+\operatorname{NW}(Xr,Yr)
Xi
X
1\leqslanti\leqslant\operatorname{length}(X)
Xi:j
j-i+1
X
\operatorname{rev}(X)
X
X
Y
x
X
y
Y
\operatorname{Del}(x)
\operatorname{Ins}(y)
\operatorname{Sub}(x,y)
x
y
x
y
We define
\operatorname{NWScore}(X,Y)
Score(i,j)
function NWScore(X, Y) Score(0, 0) = 0 // 2 * (length(Y) + 1) array for j = 1 to length(Y) Score(0, j) = Score(0, j - 1) + Ins(Yj) for i = 1 to length(X) // Init array Score(1, 0) = Score(0, 0) + Del(Xi) for j = 1 to length(Y) scoreSub = Score(0, j - 1) + Sub(Xi, Yj) scoreDel = Score(0, j) + Del(Xi) scoreIns = Score(1, j - 1) + Ins(Yj) Score(1, j) = max(scoreSub, scoreDel, scoreIns) end // Copy Score[1] to Score[0] Score(0, :) = Score(1, :) end for j = 0 to length(Y) LastLine(j) = Score(1, j) return LastLine
Note that at any point,
\operatorname{NWScore}
\operatorname{NWScore}
O(min\{\operatorname{length}(X),\operatorname{length}(Y)\})
The Hirschberg algorithm follows:
function Hirschberg(X, Y) Z = "" W = "" if length(X)
0 for i = 1 to length(X) Z = Z + Xi W = W + '-' end else if length(X)
1 (Z, W) = NeedlemanWunsch(X, Y) else xlen = length(X) xmid = length(X) / 2 ylen = length(Y) ScoreL = NWScore(X1:xmid, Y) ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y)) ymid = arg max ScoreL + rev(ScoreR) (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen) end return (Z, W)
In the context of observation (2), assume that
Xl+Xr
X
ymid
Yl=Y1:ymid
Yr=Yymid+1:\operatorname{length(Y)}
Let
\begin{align} X&=AGTACGCA,\\ Y&=TATGC,\\ \operatorname{Del}(x)&=-2,\\ \operatorname{Ins}(y)&=-2,\\ \operatorname{Sub}(x,y)&=\begin{cases}+2,&ifx=y\ -1,&ifx ≠ y.\end{cases} \end{align}
The optimal alignment is given by
W = AGTACGCA Z = --TATGC-
Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:
T A T G C 0 -2 -4 -6 -8 -10 A -2 -1 0 -2 -4 -6 G -4 -3 -2 -1 0 -2 T -6 -2 -4 0 -2 -1 A -8 -4 0 -2 -1 -3 C -10 -6 -2 -1 -3 1 G -12 -8 -4 -3 1 -1 C -14 -10 -6 -5 -1 3 A -16 -12 -8 -7 -3 1
One starts with the top level call to
\operatorname{Hirschberg}(AGTACGCA,TATGC)
X=AGTA+CGCA
\operatorname{NWScore}(AGTA,Y)
T A T G C 0 -2 -4 -6 -8 -10 A -2 -1 0 -2 -4 -6 G -4 -3 -2 -1 0 -2 T -6 -2 -4 0 -2 -1 A -8 -4 0 -2 -1 -3
Likewise,
\operatorname{NWScore}(\operatorname{rev}(CGCA),\operatorname{rev}(Y))
C G T A T 0 -2 -4 -6 -8 -10 A -2 -1 -3 -5 -4 -6 C -4 0 -2 -4 -6 -5 G -6 -2 2 0 -2 -4 C -8 -4 0 1 -1 -3
Their last lines (after reversing the latter) and sum of those are respectively
ScoreL = [-8 -4 0 -2 -1 -3 ] rev(ScoreR) = [-3 -1 1 0 -4 -8 ] Sum = [-11 -5 '''1''' -2 -5 -11]
The maximum (shown in bold) appears at ymid = 2
, producing the partition
Y=TA+TGC
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
(AGTACGCA,TATGC) / \ (AGTA,TA) (CGCA,TGC) / \ / \ (AG,) (TA,TA) (CG,TG) (CA,C) / \ / \ (T,T) (A,A) (C,T) (G,G)
The leaves of the tree contain the optimal alignment.