In statistical genetics, Felsenstein's tree-pruning algorithm (or Felsenstein's tree-peeling algorithm), attributed to Joseph Felsenstein, is an algorithm for efficiently computing the likelihood of an evolutionary tree from nucleic acid sequence data. [1] [2]
The algorithm is often used as a subroutine in a search for a maximum likelihood estimate for an evolutionary tree. Further, it can be used in a hypothesis test for whether evolutionary rates are constant (by using likelihood ratio tests). It can also be used to provide error estimates for the parameters describing an evolutionary tree.
The likelihood of a tree
T
D
D
n
s
P(D|T)
Here is an example of an evolutionary tree on arbitrary sequence data
D
This is a key value and is often quite complicated to compute. To ease the computations, Felsenstein and his colleagues used several assumptions that are still widely used today. The main assumption is that mutations between DNA sites are independent of each other. This permits to compute the likelihood as a simple product of probabilities. Now you can divide the data
D
Ds
s
D
P(D|T)=
n | |
\prod | |
s=1 |
{P(Ds|T)}
If I reuse the example above,
D1
The second assumption concerns the models of DNA sequence evolution. The computation of the likelihood needs the nucleotide frequencies as well as the transition probabilities (when a mutation occurs, probability of going from a nucleotide to another). The simplest model is the Jukes-Cantor model, assuming equal nucleotide frequencies
\left(\piA=\piG=\piC=\piT={1\over4}\right)
X
Y
pA=pA=pA=
1 | |
4 |
(1-e-\mu)
pA=e-\mu+
1 | |
4 |
(1-e-\mu)
\mu
Felsenstein proposed to decomposed computations even more by using "partial likelihoods" in the computation of
P(Ds|T)
Assume we are on a node
k
T
k
i
j
X=\{A,T,C,G\}
k
wk(X)
wk(X)=(\sumYpX\centerdotwi(Y))\centerdot(\sumZpX\centerdotwj(Z))
where
Y
Z
p
X
Y
pX
wi(Y)
i
Y
wj(Z)
Using this formula, one has to start from the tips of the tree
T
P(Ds|T)=\sumXpX\centerdotwr(X)
After doing so for every site
s