Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method uses related sequences to reconstruct an "ancestral" gene from a multiple sequence alignment.
The method can be used to 'resurrect' ancestral proteins and was suggested in 1963 by Linus Pauling and Emile Zuckerkandl.[1] In the case of enzymes, this approach has been called paleoenzymology (British: palaeoenzymology). Some early efforts were made in the 1980s and 1990s, led by the laboratory of Steven A. Benner, showing the potential of this technique.[2] Thanks to the improvement of algorithms and of better sequencing and synthesis techniques, the method was developed further in the early 2000s to allow the resurrection of a greater variety of and much more ancient genes.[3] Over the last decade, ancestral protein resurrection has developed as a strategy to reveal the mechanisms and dynamics of protein evolution.[4]
Unlike conventional evolutionary and biochemical approaches to studying proteins, i.e. the so-called horizontal comparison of related protein homologues from different branch ends of the tree of life; ASR probes the statistically inferred ancestral proteins within the nodes of the tree – in a vertical manner (see diagram, right). This approach gives access to protein properties that may have transiently arisen over evolutionary time and has recently been used as a way to infer the potential selection pressures that resulted in present-day sequences. ASR has been used to probe the causative mutation that resulted in a protein's neofunctionalization after duplication by first determining that said mutation was located between ancestors '5' and '4' on the diagram (illustratively) using functional assays.[5] In the field of protein biophysics, ASR has also been used to study the development of a protein's thermodynamic and kinetic landscapes over evolutionary time as well as protein folding pathways by combining many modern day analytical techniques such as HX/MS. These sort of insights are typically inferred from several ancestors reconstructed along a phylogeny – referring to the previous analogy, by studying nodes higher and higher (further and further back in evolutionary time) within the tree of life.[6]
Most ASR studies are conducted in vitro, and have revealed ancestral protein properties that seem to be evolutionarily desirable traits – such as increased thermostability, catalytic activity and catalytic promiscuity. These data have been accredited to artifacts of the ASR algorithms, as well as indicative illustrations of ancient Earth's environment – often, ASR research must be complemented with extensive controls (usually alternate ASR experiments) to mitigate algorithmic error. Not all studied ASR proteins exhibit this so-called 'ancestral superiority'.[7] The nascent field of 'evolutionary biochemistry' has been bolstered by the recent increase in ASR studies using the ancestors as ways to probe organismal fitness within certain cellular contexts – effectively testing ancestral proteins in vivo.[6] Due to inherent limitations in these sorts of studies – primarily being the lack of suitably ancient genomes to fit these ancestors in to, the small repertoire of well categorized laboratory model systems, and the inability to mimic ancient cellular environments; very few ASR studies in vivo have been conducted. Despite the above mentioned obstacles, preliminary insights into this avenue of research from a 2015 paper, have revealed that observed 'ancestral superiority' in vitro were not recapitulated in vivo of a given protein.[8] ASR presents one of a few mechanisms to study biochemistry of the Precambrian era of life (>541Ma) and is hence often used in 'paleogenetics'; indeed Zuckerkandl and Pauling originally intended ASR to be the starting point of a field they termed 'Paleobiochemistry'.
Several related homologues of the protein of interest are selected and aligned in a multiple sequence alignment (MSA), a 'phylogenetic tree' is constructed with statistically inferred sequences at the nodes of the branches. It is these sequences that are the so-called 'ancestors' – the process of synthesizing the corresponding DNA, transforming it into a cell and producing a protein is the so-called 'reconstruction'. Ancestral sequences are typically calculated by maximum likelihood, however Bayesian methods are also implemented. Because the ancestors are inferred from a phylogeny, the topology and composition of the phylogeny plays a major role in the output ASR sequences. Given that there is much discourse and debate over how to construct phylogenies – for example whether or not thermophilic bacteria are basal or derivative in bacterial evolution – many ASR papers construct several phylogenies with differing topologies and hence differing ASR sequences. These sequences are then compared and often several (~10) are expressed and studied per phylogenetic node. ASR does not claim to recreate the actual sequence of the ancient protein/DNA, but rather a sequence that is likely to be similar to the one that was indeed at the node. This is not considered a shortcoming of ASR as it fits into the 'neutral network' model of protein evolution, whereby at evolutionary junctions (nodes) a population of genotypically different but phenotypically similar protein sequences existed in the extant organismal population. Hence, it is possible that ASR would generate one of the sequences of a node's neutral network and while it may not represent the genotype of the last common ancestor of the modern day sequences, it does likely represent the phenotype.[6] This is supported by the modern day observation that many mutations in a protein's non-catalytic/functional site cause minor changes in biophysical properties. Hence, ASR allows one to probe the biophysical properties of past proteins and is indicative of ancient genetics.
Maximum likelihood (ML) methods work by generating a sequence where the residue at each position is predicted to be the most likely to occupy said position by the method of inference used – typically this is a scoring matrix (similar to those used in BLASTs or MSAs) calculated from extant sequences. Alternate methods include maximum parsimony (MP) that construct a sequence based on a model of sequence evolution – usually the idea that the minimum number of nucleotidal sequence changes represents the most efficient route for evolution to take and by Occam's razor is the most likely. MP is often considered the least reliable method for reconstruction as it arguably oversimplifies evolution to a degree that is not applicable on the billion year scale.
Another method involves the consideration of residue uncertainty – so-called Bayesian methods – this form of ASR is sometimes used to complement ML methods but typically produces more ambiguous sequences. In ASR, the term 'ambiguity' refers to residue positions where no clear substitution can be predicted – often in these cases, several ASR sequences are produced, encompassing most of the ambiguities and compared to one-another. ML ASR often needs complementing experiments to indicate that the derived sequences are more than just consensuses of the input sequences. This is particularly necessary in the observation of 'Ancestral Superiority'.[9] In the trend of increasing thermostability, one explanation is that ML ASR creates a consensus sequence of several different, parallel mechanisms evolved to confer minor protein thermostability throughout the phylogeny – leading to an additive effect resulting in 'superior' ancestral thermostability.[10]
The expression of consensus sequences and parallel ASR via non-ML methods are often required to disband this theory per experiment. One other concern raised by ML methods is that the scoring matrices are derived from modern sequences and particular amino acid frequencies seen today may not be the same as in Precambrian biology, resulting in skewed sequence inference. Several studies have attempted to construct ancient scoring matrices via various methodologies and have compared the resultant sequences and their protein's biophysical properties. While these modified sequences result in somewhat different ASR sequences, the observed biophysical properties did not seem to vary outside from experimental error.[11] Because of the 'holistic' nature of ASR and the intense complexity that arises when one considers all the possible sources of experimental error – the experimental community considers the ultimate measurement of ASR reliability to be the comparison of several alternate ASR reconstructions of the same node and the identification of similar biophysical properties. While this method does not offer a robust statistical, mathematical measure of reliability it does build off of the fundamental idea used in ASR that individual amino acid substitutions do not cause significant biophysical property changes in a protein – a tenant that must be held true in order to be able to overcome the effect of inference ambiguity.[12]
Candidates used for ASR are often selected based on the particular property of interest being studied – e.g. thermostability. By selecting sequences from either end of a property's range (e.g., psychrophilic proteins and thermophilic proteins) but within a protein family, ASR can be used to probe the specific sequence changes that conferred the observed biophysical effect – such as stabilising interactions. Consider in the diagram, if sequence 'A' encoded a protein that was optimally functional at neutral pHs and 'D' in acidic conditions, sequence changes between '5' and '2' may illustrate the precise biophysical explanation for this difference. As ASR experiments can extract ancestors that are likely billions of years old, there are often tens if not hundreds of sequence changes between ancestors themselves and ancestors and extant sequences – because of this, such sequence-function evolutionary studies can take a lot of work and rational direction.[13] [14]
There are many examples of ancestral proteins that have been computationally reconstructed, expressed in living cell lines, and – in many cases – purified and biochemically studied.
Some other examples are ancestral visual pigments in vertebrates,[19] enzymes in yeast that break down sugars (800Ma);[20] enzymes in bacteria that provide resistance to antibiotics (2 – 3Ga);[21] the ribonucleases involved in ruminant digestion; the alcohol dehydrogenases (Adhs) involved in yeast fermentation(~85Ma); and RuBisCO in Solanaceae.[22]
The 'age' of a reconstructed sequence is determined using a molecular clock model, and often several are employed.[23] This dating technique is often calibrated using geological time-points (such as ancient ocean constituents or BIFs) and while these clocks offer the only method of inferring a very ancient protein's age, they have sweeping error margins and are difficult to defend against contrary data. To this end, ASR 'age' should really be only used as an indicative feature and is often surpassed altogether for a measurement of the number of substitutions between the ancestral and the modern sequences (the fundament on which the clock is calculated). That being said, the use of a clock allows one to compare observed biophysical data of an ASR protein to the geological or ecological environment at the time. For example, ASR studies on bacterial EF-Tus (proteins involved in translation, that are likely rarely subject to HGT and typically exhibit Tms ~2C greater than Tenv) indicate a hotter Precambrian Earth which fits very closely with geological data on ancient earth ocean temperatures based on Oxygen-18 isotopic levels. ASR studies of yeast Adhs reveal that the emergence of subfunctionalized Adhs for ethanol metabolism (not just waste excretion) arose at a time similar to the dawn of fleshy fruit in the Cambrian Period and that before this emergence, Adh served to excrete ethanol as a byproduct of excess pyruvate. The use of a clock also perhaps indicates that the origin of life occurred before the earliest molecular fossils indicate (>4.1Ga), but given the debatable reliability of molecular clocks, such observations should be taken with caution.[24]
One example is the reconstruction of thioredoxin enzymes from up to 4 billion year old organisms.[25] Whereas the chemical activity of these reconstructed enzymes were remarkably similar to modern enzymes, their physical properties showed significantly elevated thermal and acidic stability. These results were interpreted as suggesting that ancient life may have evolved in oceans that were much hotter and more acidic than today.[25]
These experiments address various important questions in evolutionary biology: does evolution proceed in small steps or in large leaps; is evolution reversible; how does complexity evolve? It has been shown that slight mutations in the amino acid sequence of hormone receptors determine an important change in their preferences for hormones. These changes mean huge steps in the evolution of the endocrine system. Thus very small changes at the molecular level may have enormous consequences. The Thornton lab has also been able to show that evolution is irreversible studying the glucocorticoid receptor. This receptor was changed by seven mutations in a cortisol receptor, but reversing these mutations didn't give the original receptor back. Indicating that epistasis plays a major role in protein evolution – an observation that in combination with the observations of several examples of parallel evolution, support the neutral network model mentioned above. Other earlier neutral mutations acted as a ratchet and made the changes to the receptor irreversible.[26] These different experiments on receptors show that, during their evolution, proteins are greatly differentiated and this explains how complexity may evolve. A closer look at the different ancestral hormone receptors and the various hormones shows that at the level of interaction between single amino acid residues and chemical groups of the hormones arise by very small but specific changes. Knowledge about these changes may for example lead to the synthesis of hormonal equivalents capable of mimicking or inhibiting the action of a hormone, which might open possibilities for new therapies.
Given that ASR has revealed a tendency towards ancient thermostability and enzymatic promiscuity, ASR poses as a valuable tool for protein engineers who often desire these traits (producing effects sometimes greater than current, rationally lead tools). ASR also promises to 'resurrect' phenotypically similar 'ancient organisms' which in turn would allow evolutionary biochemists to probe the story of life. Proponents of ASR such as Benner state that through these and other experiments, the end of the current century will see a level of understanding in biology analogous to the one that arose in classical chemistry in the last century.