Probalign Explained

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.[2]

Alignment score

To score an alignment of two sequences two things are needed:

\sigma(x,y)

(e.g. PAM, BLOSUM,...)

g(k)=\alpha+\betak

The score

S(a)

of an alignment a is defined as:

S(a)=

\sum
xi-yj\ina

\sigma(xi,yj)+gapcost

Now the boltzmann weighted score of an alignment a is:

S(a)
T
e

=

\sum\sigma(xi,yj)+gapcost
xi-yj\ina
T
e

=\left(

\prod
xi-yi\ina
\sigma(xi,yj)
T
e

\right)

gapcost
T
e

Where

T

is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

Pr[a|x,y]=

S(a)
T
e
Z

Where

Z

is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic programming

Let

Zi,j

denote the partition function of the prefixes

x0,x1,...,xi

and

y0,y1,...,yj

. Three different cases are considered:
M
Z
i,j

:

the partition function of all alignments of the two prefixes that end in a match.
I
Z
i,j

:

the partition function of all alignments of the two prefixes that end in an insertion

(-,yj)

.
D
Z
i,j

:

the partition function of all alignments of the two prefixes that end in a deletion

(xi,-)

.Then we have:

Zi,j=

M
Z
i,j

+

D
Z
i,j

+

I
Z
i,j

Initialization

The matrixes are initialized as follows:

M
Z
0,j

=

M
Z
i,0

=0

M
Z
0,0

=1

D
Z
0,j

=0

I
Z
i,0

=0

Recursion

The partition function for the alignments of two sequences

x

and

y

is given by

Z|x|,|y|

, which can be recursively computed:
M
Z
i,j

=Zi-1,j-1

\sigma(xi,yj)
T
e
D
Z
i,j

=

D
Z
i-1,j

\beta
T
e

+

M
Z
i-1,j

g(1)
T
e

+

I
Z
i-1,j

g(1)
T
e
I
Z
i,j
analogously

Base pair probability

Finally the probability that positions

xi

and

yj

form a base pair is given by:

P(xi-yj|x,y)=

Z
\sigma(xi,yj)
T
e
Z'i',j'
i-1,j-1
Z|x|,|y|

Z',i',j'

are the respective values for the recalculated

Z

with inversed base pair strings.

See also

External links

Notes and References

  1. U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
  2. http://www.bioinf.uni-freiburg.de//Lehre/Courses/2011_WS/V_BioinfoII/probalign-partition-func.pdf Lecture "Bioinformatics II" at University of Freiburg