Exon shuffling is a molecular mechanism for the formation of new genes. It is a process through which two or more exons from different genes can be brought together ectopically, or the same exon can be duplicated, to create a new exon-intron structure.[1] There are different mechanisms through which exon shuffling occurs: transposon mediated exon shuffling, crossover during sexual recombination of parental genomes and illegitimate recombination.
Exon shuffling follows certain splice frame rules. Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons (phase 0 introns), between the first and second nucleotide of a codon (phase 1 introns), or between the second and third nucleotide of a codon (phase 2 introns). Additionally exons can be classified into nine different groups based on the phase of the flanking introns (symmetrical: 0-0, 1-1, 2-2 and asymmetrical: 0–1, 0–2, 1–0, 1–2, etc.) Symmetric exons are the only ones that can be inserted into introns, undergo duplication, or be deleted without changing the reading frame.[2]
Exon shuffling was first introduced in 1978 when Walter Gilbert discovered that the existence of introns could play a major role in the evolution of proteins.[3] It was noted that recombination within introns could help assort exons independently and that repetitive segments in the middle of introns could create hotspots for recombination to shuffle the exonic sequences. However, the presence of these introns in eukaryotes and absence in prokaryotes created a debate about the time in which these introns appeared. Two theories arose: the "introns early" theory and the "introns late" theory. Supporters of the "introns early theory" believed that introns and RNA splicing were the relics of the RNA world and therefore both prokaryotes and eukaryotes had introns in the beginning. However, prokaryotes eliminated their introns in order to obtain a higher efficiency, while eukaryotes retained the introns and the genetic plasticity of the ancestors. On the other hand, supporters of the "introns late" theory believe that prokaryotic genes resemble the ancestral genes and introns were inserted later in the genes of eukaryotes. What is clear now is that the eukaryotic exon-intron structure is not static, introns are continually inserted and removed from genes and the evolution of introns evolves parallel to exon shuffling.
In order for exon shuffling to start to play a major role in protein evolution the appearance of spliceosomal introns had to take place. This was due to the fact that the self-splicing introns of the RNA world were unsuitable for exon-shuffling by intronic recombination. These introns had an essential function and therefore could not be recombined. Additionally there is strong evidence that spliceosomal introns evolved fairly recently and are restricted in their evolutionary distribution. Therefore, exon shuffling became a major role in the construction of younger proteins.
Moreover, to define more precisely the time when exon shuffling became significant in eukaryotes, the evolutionary distribution of modular proteins that evolved through this mechanism were examined in different organisms such as Escherichia coli, Saccharomyces cerevisiae, and Arabidopsis thaliana. These studies suggested that there was an inverse relationship between the genome compactness and the proportion of intronic and repetitive sequences, and that exon shuffling became significant after metazoan radiation.[4]
Evolution of eukaryotes is mediated by sexual recombination of parental genomes and since introns are longer than exons most of the crossovers occur in noncoding regions. In these introns there are large numbers of transposable elements and repeated sequences which promote recombination of nonhomologous genes. In addition it has also been shown that mosaic proteins are composed of mobile domains which have spread to different genes during evolution and which are capable of folding themselves.
There is a mechanism for the formation and shuffling of said domains, this is the modularization hypothesis. This mechanism is divided into three stages. The first stage is the insertion of introns at positions that correspond to the boundaries of a protein domain. The second stage is when the "protomodule" undergoes tandem duplications by recombination within the inserted introns. The third stage is when one or more protomodules are transferred to a different nonhomologous gene by intronic recombination. All states of modularization have been observed in different domains such as those of hemostatic proteins.[2]
A potential mechanism for exon shuffling is the long interspersed element (LINE) -1 mediated 3' transduction. However it is important first to understand what LINEs are. LINEs are a group of genetic elements that are found in abundant quantities in eukaryotic genomes.[5] LINE-1 is the most common LINE found in humans. It is transcribed by RNA polymerase II to give an mRNA that codes for two proteins: ORF1 and ORF2, which are necessary for transposition.[6]
Upon transposition, L1 associates with 3' flanking DNA and carries the non-L1 sequence to a new genomic location. This new location does not have to be in a homologous sequence or in close proximity to the donor DNA sequence. The donor DNA sequence remains unchanged throughout this process because it functions in a copy-paste manner via RNA intermediates; however, only those regions located in the 3' region of the L1 have been proven to be targeted for duplication.
Nevertheless, there is reason to believe that this may not hold true every time as shown by the following example. The human ATM gene is responsible for the human autosomal-recessive disorder ataxia-telangiectasia and is located on chromosome 11. However, a partial ATM sequence is found in chromosome 7. Molecular features suggest that this duplication was mediated by L1 retrotransposition: the derived sequence was flanked by 15bp target side duplications (TSD), the sequence around the 5' end matched with the consensus sequence for L1 endonuclease cleavage site and a poly(A) tail preceded the 3' TSD. But since the L1 element was present in neither the retrotransposed segment nor the original sequence the mobilization of the segment cannot be explained by 3' transduction. Additional information has led to the belief that trans-mobilization of the DNA sequence is another mechanism of L1 to shuffle exons, but more research on the subject must be done.[7]
Another mechanism through which exon shuffling occurs is by the usage of helitrons. Helitron transposons were first discovered during studies of repetitive DNA segments of rice, worm and the thale crest genomes. Helitrons have been identified in all eukaryotic kingdoms, but the number of copies varies from species to species.
Helitron encoded proteins are composed of a rolling-circle (RC) replication initiator (Rep) and a DNA helicase (Hel) domain. The Rep domain is involved in the catalytic reactions for endonucleolytic cleavage, DNA transfer and ligation. In addition this domain contains three motifs. The first motif is necessary for DNA binding. The second motif has two histidines and is involved in metal ion binding. Lastly the third motif has two tyrosines and catalyzes DNA cleavage and ligation.
There are three models of gene capture by helitrons: the 'read-through" model 1 (RTM1), the 'read-through" model 2 (RTM2) and a filler DNA model (FDNA). According to the RTM1 model an accidental "malfunction" of the replication terminator at the 3' end of the Helitron leads to transposition of genomic DNA. It is composed of the read-through Helitron element and its downstream genomic regions, flanked by a random DNA site, serving as a "de novo" RC terminator. According to the RTM2 model the 3' terminus of another Helitron serves as an RC terminator of transposition. This occurs after a malfunction of the RC terminator. Lastly in the FDNA model portions of genes or non-coding regions can accidentally serve as templates during repair of ds DNA breaks occurring in helitrons.[8] Even though helitrons have been proven to be a very important evolutionary tool, the specific details for their mechanisms of transposition are yet to be defined.
An example of evolution by using helitrons is the diversity commonly found in maize. Helitrons in maize cause a constant change of genic and nongenic regions by using transposable elements, leading to diversity among different maize lines.
Long-terminal repeat (LTR) retrotransposons are part of another mechanism through which exon shuffling takes place. They usually encode two open reading frames (ORF). The first ORF named gag is related to viral structural proteins. The second ORF named pol is a polyprotein composed of an aspartic protease (AP)which cleaves the polyprotein, an Rnase H (RH) which splits the DNR-RNA hybrid, a reverse transcriptase (RT) which produces a cDNA copy of the transposons RNA and a DDE integrase which inserts cDNA into the host's genome. Additionally LTR retrotransponsons are classified into five subfamilies: Ty1/copia, Ty3/gypsy, Bel/Pao, retroviruses and endogenous retroviruses.[9]
The LTR retrotransponsons require an RNA intermediate in their transposition cycle mechanism. Retrotransponsons synthesize a cDNA copy based on the RNA strand using a reverse transcriptase related to retroviral RT. The cDNA copy is then inserted into new genomic positions to form a retrogene.[10] This mechanism has been proven to be important in gene evolution of rice and other grass species through exon shuffling.
DNA transposon with Terminal inverted repeats (TIRs) can also contribute to gene shuffling. In plants, some non-autonomous elements called Pack-TYPE can capture gene fragments during their mobilization.[11] This process appears to be mediated by acquisition of genic DNA residing between neighbouring Pack-TYPE transposons and its subsequent mobilization.[12]
Lastly, illegitimate recombination (IR) is another of the mechanisms through which exon shuffling occurs. IR is the recombination between short homologous sequences or nonhomologous sequences.[13]
There are two classes of IR: The first corresponds to errors of enzymes which cut and join DNA (i.e., DNases.) This process is initiated by a replication protein which helps generate a primer for DNA synthesis. While one DNA strand is being synthesized the other is being displaced. This process ends when the displaced strand is joined by its ends by the same replication protein. The second class of IR corresponds to the recombination of short homologous sequences which are not recognized by the previously mentioned enzymes. However, they can be recognized by non-specific enzymes which introduce cuts between the repeats. The ends are then removed by exonuclease to expose the repeats. Then the repeats anneal and the resulting molecule is repaired using polymerase and ligase.[14]