Probalign Explained

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.^[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.^[2]

Alignment score

To score an alignment of two sequences two things are needed:

a similarity function

\sigma(x,y)

(e.g. PAM, BLOSUM,...)

affine gap penalty:

g(k)=\alpha+\betak

The score

S(a)

of an alignment a is defined as:

S(a)=

\sum
	x_i-y_j\ina

\sigma(x_i,y_j)+gapcost

Now the boltzmann weighted score of an alignment a is:

	S(a)
	T

\sum		\sigma(x_i,y_j)+gapcost
	x_i-y_j\ina

=\left(

\prod
	x_i-y_i\ina

	\sigma(x_i,y_j)
	T

\right) ⋅

	gapcost
	T

Where

is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

Pr[a|x,y]=

	S(a)
	T

Where

is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic programming

Let

Z_i,j

denote the partition function of the prefixes

x_0,x_1,...,x_i

and

y_0,y_1,...,y_j

. Three different cases are considered:

	M
Z
	i,j

the partition function of all alignments of the two prefixes that end in a match.

	I
Z
	i,j

the partition function of all alignments of the two prefixes that end in an insertion

(-,y_j)

	D
Z
	i,j

the partition function of all alignments of the two prefixes that end in a deletion

(x_i,-)

.Then we have:

Z_i,j=

	M
Z
	i,j

	D
Z
	i,j

	I
Z
	i,j

Initialization

The matrixes are initialized as follows:

	M
Z
	0,j

	M
Z
	i,0

	M
Z
	0,0

	D
Z
	0,j

	I
Z
	i,0

Recursion

The partition function for the alignments of two sequences

and

is given by

Z_|x|,|y|

, which can be recursively computed:

	M
Z
	i,j

=Z_i-1,j-1 ⋅

	\sigma(x_i,y_j)
	T

	D
Z
	i,j

	D
Z
	i-1,j

⋅

	\beta
	T

	M
Z
	i-1,j

⋅

	g(1)
	T

	I
Z
	i-1,j

⋅

	g(1)
	T

	I
Z
	i,j

analogously

Base pair probability

Finally the probability that positions

x_i

and

y_j

form a base pair is given by:

P(x_i-y_j|x,y)=

⋅

	\sigma(x_i,y_j)
	T

⋅ Z'_i',j'

i-1,j-1

Z_|x|,|y|

Z',i',j'

are the respective values for the recalculated

with inversed base pair strings.

External links

Probalign Webservice

Notes and References

U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
http://www.bioinf.uni-freiburg.de//Lehre/Courses/2011_WS/V_BioinfoII/probalign-partition-func.pdf Lecture "Bioinformatics II" at University of Freiburg