A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.[1] [2] In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics (also phylogeny inference) focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.
Phylogenetic trees may be rooted or unrooted. In a rooted phylogenetic tree, each node with descendants represents the inferred most recent common ancestor of those descendants,[3] and the edge lengths in some trees may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are generally called hypothetical taxonomic units, as they cannot be directly observed. Trees are useful in fields of biology such as bioinformatics, systematics, and phylogenetics. Unrooted trees illustrate only the relatedness of the leaf nodes and do not require the ancestral root to be known or inferred.
The idea of a tree of life arose from ancient notions of a ladder-like progression from lower into higher forms of life (such as in the Great Chain of Being). Early representations of "branching" phylogenetic trees include a "paleontological chart" showing the geological relationships among plants and animals in the book Elementary Geology, by Edward Hitchcock (first edition: 1840).
Charles Darwin featured a diagrammatic evolutionary "tree" in his 1859 book On the Origin of Species. Over a century later, evolutionary biologists still use tree diagrams to depict evolution because such diagrams effectively convey the concept that speciation occurs through the adaptive and semirandom splitting of lineages.
The term phylogenetic, or phylogeny, derives from the two ancient greek words, meaning "race, lineage", and, meaning "origin, source".[4] [5]
A rooted phylogenetic tree (see two graphics at top) is a directed tree with a unique node — the root — corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree. The root node does not have a parent node, but serves as the parent of all other nodes in the tree. The root is therefore a node of degree 2, while other internal nodes have a minimum degree of 3 (where "degree" here refers to the total number of incoming and outgoing edges).
The most common method for rooting trees is the use of an uncontroversial outgroup—close enough to allow inference from trait data or molecular sequencing, but far enough to be a clear outgroup. Another method is midpoint rooting, or a tree can also be rooted by using a non-stationary substitution model.[6]
Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry. They do not require the ancestral root to be known or inferred.[7] Unrooted trees can always be generated from rooted ones by simply omitting the root. By contrast, inferring the root of an unrooted tree requires some means of identifying ancestry. This is normally done by including an outgroup in the input data so that the root is necessarily between the outgroup and the rest of the taxa in the tree, or by introducing additional assumptions about the relative rates of evolution on each branch, such as an application of the molecular clock hypothesis.[8]
Both rooted and unrooted trees can be either bifurcating or multifurcating. A rooted bifurcating tree has exactly two descendants arising from each interior node (that is, it forms a binary tree), and an unrooted bifurcating tree takes the form of an unrooted binary tree, a free tree with exactly three neighbors at each internal node. In contrast, a rooted multifurcating tree may have more than two children at some nodes and an unrooted multifurcating tree may have more than three neighbors at some nodes.
Both rooted and unrooted trees can be either labeled or unlabeled. A labeled tree has specific values assigned to its leaves, while an unlabeled tree, sometimes called a tree shape, defines a topology only. Some sequence-based trees built from a small genomic locus, such as Phylotree,[9] feature internal nodes labeled with inferred ancestral haplotypes.
The number of possible trees for a given number of leaf nodes depends on the specific type of tree, but there are always more labeled than unlabeled trees, more multifurcating than bifurcating trees, and more rooted than unrooted trees. The last distinction is the most biologically relevant; it arises because there are many places on an unrooted tree to put the root. For bifurcating labeled trees, the total number of rooted trees is:
(2n-3)!!=
(2n-3)! | |
2n-2(n-2)! |
n\ge2
n
For bifurcating labeled trees, the total number of unrooted trees is:[10]
(2n-5)!!=
(2n-5)! | |
2n-3(n-3)! |
n\ge3
Among labeled bifurcating trees, the number of unrooted trees with
n
n-1
The number of rooted trees grows quickly as a function of the number of tips. For 10 tips, there are more than
34 x 106
1 | 1 | 1 | 0 | 1 |
2 | 1 | 1 | 0 | 1 |
3 | 1 | 3 | 1 | 4 |
4 | 3 | 15 | 11 | 26 |
5 | 15 | 105 | 131 | 236 |
6 | 105 | 945 | 1,807 | 2,752 |
7 | 945 | 10,395 | 28,813 | 39,208 |
8 | 10,395 | 135,135 | 524,897 | 660,032 |
9 | 135,135 | 2,027,025 | 10,791,887 | 12,818,912 |
10 | 2,027,025 | 34,459,425 | 247,678,399 | 282,137,824 |
A dendrogram is a general name for a tree, whether phylogenetic or not, and hence also for the diagrammatic representation of a phylogenetic tree.[12]
A cladogram only represents a branching pattern; i.e., its branch lengths do not represent time or relative amount of character change, and its internal nodes do not represent ancestors.[13]
A phylogram is a phylogenetic tree that has branch lengths proportional to the amount of character change.[14]
A chronogram is a phylogenetic tree that explicitly represents time through its branch lengths.[15]
A Dahlgrenogram is a diagram representing a cross section of a phylogenetic tree.
A phylogenetic network is not strictly speaking a tree, but rather a more general graph, or a directed acyclic graph in the case of rooted networks. They are used to overcome some of the limitations inherent to trees.
A spindle diagram, or bubble diagram, is often called a romerogram, after its popularisation by the American palaeontologist Alfred Romer.[16] It represents taxonomic diversity (horizontal width) against geological time (vertical axis) in order to reflect the variation of abundance of various taxa through time.A spindle diagram is not an evolutionary tree:[17] the taxonomic spindles obscure the actual relationships of the parent taxon to the daughter taxon[16] and have the disadvantage of involving the paraphyly of the parental group.[18] This type of diagram is no longer used in the form originally proposed.[18]
Darwin[19] also mentioned that the coral may be a more suitable metaphor than the tree. Indeed, phylogenetic corals are useful for portraying past and present life, and they have some advantages over trees (anastomoses allowed, etc.).[18]
See main article: Computational phylogenetics. Phylogenetic trees composed with a nontrivial number of input sequences are constructed using computational phylogenetics methods. Distance-matrix methods such as neighbor-joining or UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to implement, but do not invoke an evolutionary model. Many sequence alignment methods such as ClustalW also create trees by using the simpler algorithms (i.e. those based on distance) of tree construction. Maximum parsimony is another simple method of estimating phylogenetic trees, but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the optimality criterion of maximum likelihood, often within a Bayesian framework, and apply an explicit model of evolution to phylogenetic tree estimation. Identifying the optimal tree using many of these techniques is NP-hard, so heuristic search and optimization methods are used in combination with tree-scoring functions to identify a reasonably good tree that fits the data.
Tree-building methods can be assessed on the basis of several criteria:[20]
Tree-building techniques have also gained the attention of mathematicians. Trees can also be built using T-theory.[21]
Trees can be encoded in a number of different formats, all of which must represent the nested structure of a tree. They may or may not encode branch lengths and other features. Standardized formats are critical for distributing and sharing trees without relying on graphics output that is hard to import into existing software. Commonly used formats are
Although phylogenetic trees produced on the basis of sequenced genes or genomic data in different species can provide evolutionary insight, these analyses have important limitations. Most importantly, the trees that they generate are not necessarily correct – they do not necessarily accurately represent the evolutionary history of the included taxa. As with any scientific result, they are subject to falsification by further study (e.g., gathering of additional data, analyzing the existing data with improved methods). The data on which they are based may be noisy;[22] the analysis can be confounded by genetic recombination,[23] horizontal gene transfer,[24] hybridisation between species that were not nearest neighbors on the tree before hybridisation takes place, and conserved sequences.
Also, there are problems in basing an analysis on a single type of character, such as a single gene or protein or only on morphological analysis, because such trees constructed from another unrelated data source often differ from the first, and therefore great care is needed in inferring phylogenetic relationships among species. This is most true of genetic material that is subject to lateral gene transfer and recombination, where different haplotype blocks can have different histories. In these types of analysis, the output tree of a phylogenetic analysis of a single gene is an estimate of the gene's phylogeny (i.e. a gene tree) and not the phylogeny of the taxa (i.e. species tree) from which these characters were sampled, though ideally, both should be very close. For this reason, serious phylogenetic studies generally use a combination of genes that come from different genomic sources (e.g., from mitochondrial or plastid vs. nuclear genomes),[25] or genes that would be expected to evolve under different selective regimes, so that homoplasy (false homology) would be unlikely to result from natural selection.
When extinct species are included as terminal nodes in an analysis (rather than, for example, to constrain internal nodes), they are considered not to represent direct ancestors of any extant species. Extinct species do not typically contain high-quality DNA.
The range of useful DNA materials has expanded with advances in extraction and sequencing technologies. Development of technologies able to infer sequences from smaller fragments, or from spatial patterns of DNA degradation products, would further expand the range of DNA considered useful.
Phylogenetic trees can also be inferred from a range of other data types, including morphology, the presence or absence of particular types of genes, insertion and deletion events – and any other observation thought to contain an evolutionary signal.
Phylogenetic networks are used when bifurcating trees are not suitable, due to these complications which suggest a more reticulate evolutionary history of the organisms sampled.