ProbCons explained

ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant advantage in accuracy over similar tools, including Clustal and MAFFT.^[1] ^[2]

Algorithm

The following describes the basic outline of the ProbCons algorithm.^[3]

Step 1: Reliability of an alignment edge

For every pair of sequences compute the probability that letters

x_i

and

y_i

are paired in

a^*

an alignment that is generated by the model.

\begin{align} P(x_i\simy_i|x,y)&\stackrel{def}{=}Pr[x_i\simy_iinsomea|x,y]\\ &=

\sum
	alignmentawithx_i-y_i

Pr[a|x,y]\\ &=\sum_alignmenta1\{x_i-y_i\ina\}Pr[a|x,y] \end{align}

(Where

1\{x_i\simy_i\ina\}

is equal to 1 if

x_i

and

y_i

are in the alignment and 0 otherwise.)

Step 2: Maximum expected accuracy

The accuracy of an alignment

a^*

with respect to another alignment

is defined as the number of common aligned pairs divided by the length of the shorter sequence.

Calculate expected accuracy of each sequence:

\begin{align} E_Pr[a|x,y](acc(a^*,a))&=\sum_aPr[a|x,y]acc(a^*,a)\\ &=

	1
	min(\|x\|,\|y\|)

⋅ \sum_a1\{x_i\simy_i\ina\}Pr[a|x,y]\\ &=

	1
	min(\|x\|,\|y\|)

⋅

\sum
	x_i-y_i

P(x_i\simy_{j|x,y)
\end{align}}

This yields a maximum expected accuracy (MEA) alignment:

E(x,y)=

\argmax
	a^*

E_Pr[a|x,y](acc(a^*,a))

Step 3: Probabilistic Consistency Transformation

All pairs of sequences x,y from the set of all sequences

l{S}

are now re-estimated using all intermediate sequences z:

P'(x_i-y_i|x,y)=

	1
	\|l{S

|}\sum_z\sum₁P(x_i\simz_i|x,z) ⋅ P(z_i\simy_i|z,y)

This step can be iterated.

Step 4: Computation of guide tree

Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.

Step 5: Compute MSA

Finally compute the MSA using progressive alignment or iterative alignment.

Notes and References

10.1101/gr.2821705 . Do CB, Mahabhashyam MS, Brudno M, Batzoglou S . 2005 . PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment . Genome Research . 15 . 2 . 330–340 . 15687296 . 546535.
Book: Roshan, Usman. Multiple Sequence Alignment Methods. 1079. 2014-01-01. Humana Press. 9781627036450. Russell. David J. Methods in Molecular Biology. 147–153. English. 10.1007/978-1-62703-646-7_9. 24170400. Multiple Sequence Alignment Using Probcons and Probalign.
http://www.bioinf.uni-freiburg.de//Lehre/Courses/2011_WS/V_BioinfoII/slides_probcons.pdf Lecture "Bioinformatics II" at University of Freiburg