Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation.[1] Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.
Genetic distance is useful for reconstructing the history of populations, such as the multiple human expansions out of Africa.[2] It is also used for understanding the origin of biodiversity. For example, the genetic distances between different breeds of domesticated animals are often investigated in order to determine which breeds should be protected to maintain genetic diversity.[3]
Life on earth began from very simple unicellular organisms evolving into most complex multicellular organisms through the course of over three billion years.[4] Creating a comprehensive tree of life that represents all the organisms that have ever lived on earth is important for understanding the evolution of life in the face of all challenges faced by living organisms to deal with similar challenges in future. Evolutionary biologists have attempted to create evolutionary or phylogenetic trees encompassing as many organisms as possible based on the available resources. Fossil dating and molecular clock are the two means of generating evolutionary history of living organisms. Fossil record is random, incomplete and does not provide a continuous chain of events like a movie with missing frames cannot tell the whole plot of the movie.[4]
Molecular clocks on the other hand are specific sequences of DNA, RNA or proteins (amino acids) that are used to determine at molecular level the similarities and differences among species, to find out the timeline of divergence,[5] and to trace back the common ancestor of species based on the mutation rates and sequence changes being accumulated in those specific sequences.[5] The primary driver of evolution is the mutation or changes in genes and accounting for those changes overtime determines the approximate genetic distance between species. These specific molecular clocks are fairly conserved across a range of species and have a constant rate of mutation like a clock and are calibrated based on evolutionary events (fossil records). For example, gene for alpha-globin (constituent of hemoglobin) mutates at a rate of 0.56 per base pair per billion years.[5] The molecular clock can fill those gaps created by missing fossil records.
In the genome of an organism, each gene is located at a specific place called the locus for that gene. Allelic variations at these loci cause phenotypic variation within species (e.g. hair colour, eye colour). However, most alleles do not have an observable impact on the phenotype. Within a population new alleles generated by mutation either die out or spread throughout the population. When a population is split into different isolated populations (by either geographical or ecological factors), mutations that occur after the split will be present only in the isolated population. Random fluctuation of allele frequencies also produces genetic differentiation between populations. This process is known as genetic drift. By examining the differences between allele frequencies between the populations and computing genetic distance, we can estimate how long ago the two populations were separated.[6]
Let’s suppose a sequence of DNA or a hypothetical gene that has mutation rate of one base per 10 million years. Using this sequence of DNA, the divergence of two different species or genetic distance between two different species can be determined by counting the number of base pair differences among them. For example, in Figure 2 a difference of 4 bases in the hypothetical sequence among those two species would indicate that they diverged 40 million years ago, and their common ancestor would have lived at least 20 million years ago before their divergence. Based on molecular clock, the equation below can be used to calculate the time since divergence.
Number of mutation ÷ Mutation per year (rate of mutation) = time since divergence
Recent advancement in sequencing technology and the availability of comprehensive genomic databases and bioinformatics tools that are capable of storing and processing colossal amount of data generated by the advanced sequencing technology has tremendously improved evolutionary studies and the understanding of evolutionary relationships among species.[7] [8]
Different biomolecular markers such DNA, RNA and amino acid sequences (protein) can be used for determining the genetic distance.[9] [10]
The selection criteria[11] of appropriate biomarker for genetic distance entails the following three steps:
The choice of variability depends on the intended outcome. For example, very high level of variability is recommended for demographic studies and parentage analyses, medium to high variability for comparing distinct populations, and moderate to very low variability is recommended for phylogenetic studies. The genomic localization and ploidy of the marker is also an important factor. For example, the gene copy numbe
The choice and examples of molecular markers for evolutionary biology studies.
Biological issues/biodiversity level | Level of variability | Nature of information required | Examples of most used markers | ||
Intra-population | Population structure, reproduction system | Medium to high | (N) codominant loci = (Multilocus) | Microsatellites, allozymes | |
Fingerprinting. parentage analysis | Very high | Codominant loci or numerous dominant loci | Microsatellites (RAPD, AFLP) | ||
Demography | Medium to high | Allele frequency in samples taken at different times | Allozymes, Microsatellites | ||
Demographic history | Medium to high | Allele frequency + evolutionary relationships | Mt-DNA sequences | ||
Inter-population | Phylogeography, definition of evolutionary significant units (population structure) | Medium to high | Allele frequency in each population | Allozymes, microsatellites (risk of size homoplasy) | |
Bio-conservation | Medium | Allele evolutionary relationships | Mt-DNA (if variable enough) | ||
Inter-specific | Close species | ca. 1%/my | No variability within species if possible | Sequences of Mt-DNA, ITS rDNA |
Evolutionary forces such as mutation, genetic drift, natural selection, and gene flow drive the process of evolution and genetic diversity. All these forces play significant role in genetic distance within and among species.[17]
Different statistical measures exist that aim to quantify genetic deviation between populations or species. By utilizing assumptions gained from experimental analysis of evolutionary forces, a model that more accurately suits a given experiment can be selected to study a genetic group. Additionally, comparing how well different metrics model certain population features such as isolation can identify metrics that are more suited for understanding newly studied groups[18] The most commonly used genetic distance metrics are Nei's genetic distance,[6] Cavalli-Sforza and Edwards measure,[19] and Reynolds, Weir and Cockerham's genetic distance.[20]
One of the most basic and straight forward distance measures is Jukes-Cantor distance. This measure is constructed based on the assumption that no insertions or deletions occurred, all substitutions are independent, and that each nucleotide change is equally likely. With these presumptions, we can obtain the following equation:[21]
dAB=-
3 | ln(1- | |
4 |
4 | |
3 |
fAB)
dAB
fAB
In 1972, Masatoshi Nei published what came to be known as Nei's standard genetic distance. This distance has the nice property that if the rate of genetic change (amino acid substitution) is constant per year or generation then Nei's standard genetic distance (D) increases in proportion to divergence time. This measure assumes that genetic differences are caused by mutation and genetic drift.[6]
D=-ln | \sum\limits\ell\sum\limitsuXuYu | |||||||||||||
|
This distance can also be expressed in terms of the arithmetic mean of gene identity. Let
jX
X
jY
Y
jXY
X
Y
JX
JY
JXY
jX
jY
jXY
JX=\sumu
{Xu | |
2}{L} |
JY=\sumu
{Yu | |
2}{L} |
JXY=\sum\ell\sumu
XuYu | |
L |
where
L
Nei's standard distance can then be written as[6]
D=-ln
JXY | |
\sqrt{JXJY |
In 1967 Luigi Luca Cavalli-Sforza and A. W. F. Edwards published this measure. It assumes that genetic differences arise due to genetic drift only. One major advantage of this measure is that the populations are represented in a hypersphere, the scale of which is one unit per gene substitution. The chord distance in the hyperdimensional sphere is given by[1] [19]
DCH=
2 | |
\pi |
\sqrt{2\left(1-\sum\ell\sumu\sqrt{XuYu}\right)}
Some authors drop the factor
2 | |
\pi |
\Theta
\Theta | ||||||||||||||||
|
The Kimura two parameter model (K2P) was developed in 1980 by Japanese biologist Motoo Kimura. It is compatible with the neutral theory of evolution, which was also developed by the same author. As depicted in Figure 4, this measure of genetic distance accounts for the type of mutation occurring, namely whether it is a transition (i.e. purine to purine or pyrimidine to pyrimidine) or a transversion (i.e. purine to pyrimidine or vice versa). With this information, the following formula can be derived:
K=- | 1 |
2 |
loge[(1-2P-Q)\sqrt{1-2Q}]
where P is
n1 | |
n |
n2 | |
n |
n1
n2
n
It is worth noting when transition and transversion type substitutions have an equal chance of occurring, and
P
Q | |
2 |
P
Q
It has been shown that while K2P works well in classifying distantly-related species, it is not always the best choice for comparing closely-related species. In these cases, it may be better to use p-distance instead.[24]
The Kimura three parameter (K3P) model was first published in 1981. This measure assumes three rates of substitution when nucleotides mutate, which can be seen in Figure 5. There is one rate for transition type mutations, one rate for transversion type mutations to corresponding bases (e.g. G to C; transversion type 1 in the figure), and one rate for transversion type mutations to non-corresponding bases (e.g. G to T; transversion type 2 in the figure).
With these rates of substitution, the following formula can be derived:
K=- | 1 |
4 |
loge[(1-2P-2Q)(1-2P-2R)(1-2Q-2R)]
where
P
Q
R
Q
R
Many other measures of genetic distance have been proposed with varying success.
Nei's DA distance was created by Masatoshi Nei, a Japanese-American biologist in 1983. This distance assumes that genetic differences arise due to mutation and genetic drift, but this distance measure is known to give more reliable population trees than other distances particularly for microsatellite DNA data. This method is not ideal in cases where natural selection plays a significant role in a populations genetics.[26] [27]
DA=1-\sum\ell\sumu\sqrt{XuYu}/{L}
DA
Nei's DA distance, the genetic distance between populations X and Y
\ell
A locus or gene studied with
\sum\ell
Xu
Yu
L: The total number of loci examined
Euclidean distance is a formula brought about from Euclid's Elements, a 13 book set detailing the foundation of all euclidean mathematics. The foundational principles outlined in these works is used not only in euclidean spaces but expanded upon by Issac Newton and Gottfried Leibniz in isolated pursuits to create calculus.[28] The euclidean distance formula is used to convey, as simply as possible, the genetic dissimilarity between populations, with a larger distance indicating greater dissimilarity.[29] As seen in figure 6, this method can be visualized in a graphical manner, this is due to the work of René Descartes who created the fundamental principle of analytic geometry, or the cartesian coordinate system. In an interesting example of historical repetitions, René Descartes was not the only one who discovered the fundamental principle of analytical geometry, this principle was as discovered in an isolated pursuit by Pierre de Fermat who left his work unpublished.[30] [31]
See main article: Euclidean distance.
DEU=\sqrt{\sumu(Xu-Y
2} | |
u) |
DEU
Euclidean genetic distance between populations X and Y
Xu
Yu
It was specifically developed for microsatellite markers and is based on the stepwise-mutation model (SMM). The Goldstein distance formula is modeled in such a way that expected value will increase linearly with time, this property is maintained even when the assumptions of single-step mutations and symmetrical mutation rate are violated. Goldstein distance is derived from the average square distance model, of which Goldstein was also a contributor.[32]
2=\sum | |
(\delta\mu) | |
\ell |
| |||||||||||||
L |
\delta\mu
Goldstein genetic distance between populations X and Y
\mux
\muy
L: Total number of microsatallite loci examined
This calculation represents the minimum amount of codon differences for each locus.[33] The measurement is based on the assumption that genetic differences arise due to mutation and genetic drift.[34]
D | ||||
|
-JXY
: Minimum amount of codon difference per locus
JX
JY
JXY
Average probability of members of the X and Y populations having the same allele
Similar to Euclidean distance, Czekanowski distance involves calculated the distance between points of allele frequency that are graphed on an axis created by . However, Czekanowski assumes a direct path is not available and sums the sides of the triangle formed by the data points instead of finding the hypotenuse. This formula is nicknamed the Manhattan distance because its methodology is similar to the nature of the New York City burrow. Manhattan is mainly built on a grid system requiring resentence to only make 90 degree turns during travel, which parallels the thinking of the formula.
DCz=
1 | |
2 |
|Xu-Yu|
|Xu-Yu|=|PXx-PYx|+|PXy-PYy|
Xu
Yu
PXx
PYx
PXy
PYy
Similar to Czekanowski distance, Roger's distance involves calculating the distance between points of allele frequency. However, this method takes the direct distance between the points.
DR=
1 | \sqrt | |
L |
| |||||||||||||
2 |
Xu
Yu
L
Total number of microsatallite loci examined
While these formulas are easy and quick calculations to make, the information that is provided gives limited information. The results of these formulas do not account for the potential effects of the number of codon changes between populations, or separation time between populations.[36]
See main article: Fixation index. A commonly used measure of genetic distance is the fixation index (FST) which varies between 0 and 1. A value of 0 indicates that two populations are genetically identical (minimal or no genetic diversity between the two populations) whereas a value of 1 indicates that two populations are genetically different (maximum genetic diversity between the two populations). No mutation is assumed. Large populations between which there is much migration, for example, tend to be little differentiated whereas small populations between which there is little migration tend to be greatly differentiated. FST is a convenient measure of this differentiation, and as a result FST and related statistics are among the most widely used descriptive statistics in population and evolutionary genetics. But FST is more than a descriptive statistic and measure of genetic differentiation. FST is directly related to the Variance in allele frequency among populations and conversely to the degree of resemblance among individuals within populations. If FST is small, it means that allele frequencies within each population are very similar; if it is large, it means that allele frequencies are very different.
(\delta\mu)2