AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure.[1] The program is designed as a deep learning system.[2]
AlphaFold software has had three major versions. A team of researchers that used AlphaFold 1 (2018) placed first in the overall rankings of the 13th Critical Assessment of Structure Prediction (CASP) in December 2018. The program was particularly successful at predicting the most accurate structure for targets rated as the most difficult by the competition organisers, where no existing template structures were available from proteins with a partially similar sequence. A team that used AlphaFold 2 (2020) repeated the placement in the CASP14 competition in November 2020.[3] The team achieved a level of accuracy much higher than any other group.[4] It scored above 90 for around two-thirds of the proteins in CASP's global distance test (GDT), a test that measures the degree to which a computational program predicted structure is similar to the lab experiment determined structure, with 100 being a complete match, within the distance cutoff used for calculating GDT.[5]
AlphaFold 2's results at CASP14 were described as "astounding" and "transformational". Some researchers noted that the accuracy is not high enough for a third of its predictions, and that it does not reveal the mechanism or rules of protein folding for the protein folding problem to be considered solved.[6] [7] Nevertheless, there has been widespread respect for the technical achievement. On 15 July 2021 the AlphaFold 2 paper was published in Nature as an advance access publication alongside open source software and a searchable database of species proteomes.[8] [9] [10]
AlphaFold 3 was announced on 8 May 2024. It can predict the structure of complexes created by proteins with DNA, RNA, various ligands, and ions.[11]
See also: Protein structure prediction and De novo protein structure prediction.
Proteins consist of chains of amino acids which spontaneously fold to form the three dimensional (3-D) structures of the proteins. The 3-D structure is crucial to understanding the biological function of the protein.
Protein structures can be determined experimentally through techniques such as X-ray crystallography, cryo-electron microscopy and nuclear magnetic resonance, which are all expensive and time-consuming.[12] Such efforts, using the experimental methods, have identified the structures of about 170,000 proteins over the last 60 years, while there are over 200 million known proteins across all life forms.[5]
Over the years, researchers have applied numerous computational methods to predict the 3D structures of proteins from their amino acid sequences, but the accuracy of such methods has not been close to experimental techniques. CASP, which was launched in 1994 to challenge the scientific community to produce their best protein structure predictions, found that GDT scores of only about 40 out of 100 can be achieved for the most difficult proteins by 2016.[5] AlphaFold started competing in the 2018 CASP using an artificial intelligence (AI) deep learning technique.
DeepMind is known to have trained the program on over 170,000 proteins from a public repository of protein sequences and structures. The program uses a form of attention network, a deep learning technique that focuses on having the AI identify parts of a larger problem, then piece it together to obtain the overall solution.[2] The overall training was conducted on processing power between 100 and 200 GPUs.[2]
AlphaFold 1 (2018) was built on work developed by various teams in the 2010s, work that looked at the large databanks of related DNA sequences now available from many different organisms (most without known 3D structures), to try to find changes at different residues that appeared to be correlated, even though the residues were not consecutive in the main chain. Such correlations suggest that the residues may be close to each other physically, even though not close in the sequence, allowing a contact map to be estimated. Building on recent work prior to 2018, AlphaFold 1 extended this to estimate a probability distribution for just how close the residues might be likely to be—turning the contact map into a likely distance map. It also used more advanced learning methods than previously to develop the inference.[13] [14]
The 2020 version of the program (AlphaFold 2, 2020) is significantly different from the original version that won CASP 13 in 2018, according to the team at DeepMind.[15]
The software design used in AlphaFold 1 contained a number of modules, each trained separately, that were used to produce the guide potential that was then combined with the physics-based energy potential. AlphaFold 2 replaced this with a system of sub-networks coupled together into a single differentiable end-to-end model, based entirely on pattern recognition, which was trained in an integrated way as a single integrated structure.[16] Local physics, in the form of energy refinement based on the AMBER model, is applied only as a final refinement step once the neural network prediction has converged, and only slightly adjusts the predicted structure.[17]
A key part of the 2020 system are two modules, believed to be based on a transformer design, which are used to progressively refine a vector of information for each relationship (or "edge" in graph-theory terminology) between an amino acid residue of the protein and another amino acid residue (these relationships are represented by the array shown in green); and between each amino acid position and each different sequences in the input sequence alignment (these relationships are represented by the array shown in red). Internally these refinement transformations contain layers that have the effect of bringing relevant data together and filtering out irrelevant data (the "attention mechanism") for these relationships, in a context-dependent way, learnt from training data. These transformations are iterated, the updated information output by one step becoming the input of the next, with the sharpened residue/residue information feeding into the update of the residue/sequence information, and then the improved residue/sequence information feeding into the update of the residue/residue information. As the iteration progresses, according to one report, the "attention algorithm ... mimics the way a person might assemble a jigsaw puzzle: first connecting pieces in small clumps—in this case clusters of amino acids—and then searching for ways to join the clumps in a larger whole."
The output of these iterations then informs the final structure prediction module, which also uses transformers,[18] and is itself then iterated. In an example presented by DeepMind, the structure prediction module achieved a correct topology for the target protein on its first iteration, scored as having a GDT_TS of 78, but with a large number (90%) of stereochemical violations – i.e. unphysical bond angles or lengths. With subsequent iterations the number of stereochemical violations fell. By the third iteration the GDT_TS of the prediction was approaching 90, and by the eighth iteration the number of stereochemical violations was approaching zero.[19]
The training data was originally restricted to single peptide chains. However, the October 2021 update, named AlphaFold-Multimer, included protein complexes in its training data. DeepMind stated this update succeeded about 70% of the time at accurately predicting protein-protein interactions.[20]
Announced on 8 May 2024, AlphaFold 3 was co-developed by Google DeepMind and Isomorphic Labs, both subsidiaries of Alphabet. AlphaFold 3 is not limited to single-chain proteins, as it can also predict the structures of protein complexes with DNA, RNA, post-translational modifications and selected ligands and ions.[21]
AlphaFold 3 introduces the "Pairformer", a deep learning architecture inspired from the transformer, considered similar but simpler than the Evoformer introduced with AlphaFold 2.[22] [23] The raw predictions from the Pairformer module are passed to a diffusion model, which starts with a cloud of atoms and uses these predictions to iteratively progress towards a 3D depiction of the molecular structure.
The AlphaFold server was created to provide free access to AlphaFold 3 for non-commercial research.[24]
In December 2018, DeepMind's AlphaFold placed first in the overall rankings of the 13th Critical Assessment of Techniques for Protein Structure Prediction (CASP).[25]
The program was particularly successfully predicting the most accurate structure for targets rated as the most difficult by the competition organisers, where no existing template structures were available from proteins with a partially similar sequence. AlphaFold gave the best prediction for 25 out of 43 protein targets in this class,[26] [27] [28] achieving a median score of 58.9 on the CASP's global distance test (GDT) score, ahead of 52.5 and 52.4 by the two next best-placed teams,[29] who were also using deep learning to estimate contact distances.[30] [31] Overall, across all targets, the program achieved a GDT score of 68.5.[32]
In January 2020, implementations and illustrative code of AlphaFold 1 was released open-source on GitHub.[33] but, as stated in the "Read Me" file on that website: "This code can't be used to predict structure of an arbitrary protein sequence. It can be used to predict structure only on the CASP13 dataset (links below). The feature generation code is tightly coupled to our internal infrastructure as well as external tools, hence we are unable to open-source it." Therefore, in essence, the code deposited is not suitable for general use but only for the CASP13 proteins. The company has not announced plans to make their code publicly available as of 5 March 2021.
In November 2020, DeepMind's new version, AlphaFold 2, won CASP14.[34] [35] Overall, AlphaFold 2 made the best prediction for 88 out of the 97 targets.
On the competition's preferred global distance test (GDT) measure of accuracy, the program achieved a median score of 92.4 (out of 100), meaning that more than half of its predictions were scored at better than 92.4% for having their atoms in more-or-less the right place,[36] [37] a level of accuracy reported to be comparable to experimental techniques like X-ray crystallography.[38] In 2018 AlphaFold 1 had only reached this level of accuracy in two of all of its predictions. 88% of predictions in the 2020 competition had a GDT_TS score of more than 80. On the group of targets classed as the most difficult, AlphaFold 2 achieved a median score of 87.
Measured by the root-mean-square deviation (RMS-D) of the placement of the alpha-carbon atoms of the protein backbone chain, which tends to be dominated by the performance of the worst-fitted outliers, 88% of AlphaFold 2's predictions had an RMS deviation of less than 4 Å for the set of overlapped C-alpha atoms. 76% of predictions achieved better than 3 Å, and 46% had a C-alpha atom RMS accuracy better than 2 Å,[39] with a median RMS deviation in its predictions of 2.1 Å for a set of overlapped CA atoms. AlphaFold 2 also achieved an accuracy in modelling surface side chains described as "really really extraordinary".
To additionally verify AlphaFold-2 the conference organisers approached four leading experimental groups for structures they were finding particularly challenging and had been unable to determine. In all four cases the three-dimensional models produced by AlphaFold 2 were sufficiently accurate to determine structures of these proteins by molecular replacement. These included target T1100 (Af1503), a small membrane protein studied by experimentalists for ten years.[5]
Of the three structures that AlphaFold 2 had the least success in predicting, two had been obtained by protein NMR methods, which define protein structure directly in aqueous solution, whereas AlphaFold was mostly trained on protein structures in crystals. The third exists in nature as a multidomain complex consisting of 52 identical copies of the same domain, a situation AlphaFold was not programmed to consider. For all targets with a single domain, excluding only one very large protein and the two structures determined by NMR, AlphaFold 2 achieved a GDT_TS score of over 80.
In 2022 DeepMind did not enter CASP15, but most of the entrants used AlphaFold or tools incorporating AlphaFold.[40]
AlphaFold 2 scoring more than 90 in CASP's global distance test (GDT) is considered a significant achievement in computational biology[5] and great progress towards a decades-old grand challenge of biology. Nobel Prize winner and structural biologist Venki Ramakrishnan called the result "a stunning advance on the protein folding problem",[5] adding that "It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research."
Propelled by press releases from CASP and DeepMind,[41] AlphaFold 2's success received wide media attention.[42] As well as news pieces in the specialist science press, such as Nature, Science,[5] MIT Technology Review,[2] and New Scientist,[43] [44] the story was widely covered by major national newspapers,.[45] [46] [47] [48] A frequent theme was that ability to predict protein structures accurately based on the constituent amino acid sequence is expected to have a wide variety of benefits in the life sciences space including accelerating advanced drug discovery and enabling better understanding of diseases.[49] [50] Some have noted that even a perfect answer to the protein prediction problem would still leave questions about the protein folding problem—understanding in detail how the folding process actually occurs in nature (and how sometimes they can also misfold).[51]
In 2023, Demis Hassabis and John Jumper won the Breakthrough Prize in Life Sciences[52] as well as the Albert Lasker Award for Basic Medical Research for their management of the AlphaFold project.[53]
The open access to source code of several AlphaFold versions (excluding AlphaFold 3) has been provided by DeepMind after requests from the scientific community.[54] [55] [56] Full source code of AlphaFold-3 is expected to be provided to open access by the end of 2024.[57] [58]
AlphaFold Protein Structure Database | |
Scope: | protein structure prediction |
Organism: | all UniProt proteomes |
Center: | EMBL-EBI |
Url: | https://www.alphafold.ebi.ac.uk/ |
Download: | yes |
Webapp: | yes |
License: | CC-BY 4.0 |
Curation: | automatic |
The AlphaFold Protein Structure Database was launched on July 22, 2021, as a joint effort between AlphaFold and EMBL-EBI. At launch the database contains AlphaFold-predicted models of protein structures of nearly the full UniProt proteome of humans and 20 model organisms, amounting to over 365,000 proteins. The database does not include proteins with fewer than 16 or more than 2700 amino acid residues,[59] but for humans they are available in the whole batch file.[60] AlphaFold planned to add more sequences to the collection, the initial goal (as of beginning of 2022) being to cover most of the UniRef90 set of more than 100 million proteins. As of May 15, 2022, 992,316 predictions were available.[61]
In July 2021, UniProt-KB and InterPro[62] has been updated to show AlphaFold predictions when available.[63]
On July 28, 2022, the team uploaded to the database the structures of around 200 million proteins from 1 million species, covering nearly every known protein on the planet.[64]
AlphaFold has various limitations:
AlphaFold has been used to predict structures of proteins of SARS-CoV-2, the causative agent of COVID-19. The structures of these proteins were pending experimental detection in early 2020.[73] Results were examined by the scientists at the Francis Crick Institute in the United Kingdom before release into the larger research community. The team also confirmed accurate prediction against the experimentally determined SARS-CoV-2 spike protein that was shared in the Protein Data Bank, an international open-access database, before releasing the computationally determined structures of the under-studied protein molecules.[74] The team acknowledged that although these protein structures might not be the subject of ongoing therapeutical research efforts, they will add to the community's understanding of the SARS-CoV-2 virus. Specifically, AlphaFold 2's prediction of the structure of the ORF3a protein was very similar to the structure determined by researchers at University of California, Berkeley using cryo-electron microscopy. This specific protein is believed to assist the virus in breaking out of the host cell once it replicates. This protein is also believed to play a role in triggering the inflammatory response to the infection.[75]