Molecular Evolutionary Genetics Analysis should not be confused with Mega2, the Manipulation Environment for Genetic Analysis.
Molecular Evolutionary Genetics Analysis | |
Logo Size: | 250px |
Logo Alt: | White sans-serif capital letters spelling MEGA, each in a separate square, colored one red and three black, respectively. |
Author: | Masatoshi Nei, Sudhir Kumar, Koichiro Tamura, Glen Stecher, Daniel Peterson, Nicholas Peterson |
Developer: | Pennsylvania State University |
Latest Release Version: | 11.0.13 |
Latest Release Date: | [1] |
Operating System: | Windows, OS X, Linux |
Platform: | x86, x86-64 |
Language: | English |
Genre: | Bioinformatics |
License: | Proprietary freeware |
Molecular Evolutionary Genetics Analysis (MEGA) is computer software for conducting statistical analysis of molecular evolution and for constructing phylogenetic trees. It includes many sophisticated methods and tools for phylogenomics and phylomedicine. It is licensed as proprietary freeware. The project for developing this software was initiated by the leadership of Masatoshi Nei in his laboratory at the Pennsylvania State University in collaboration with his graduate student Sudhir Kumar and postdoctoral fellow Koichiro Tamura.[2] Nei wrote a monograph (pp. 130) outlining the scope of the software and presenting new statistical methods that were included in MEGA. The entire set of computer programs was written by Kumar and Tamura. The personal computers then lacked the ability to send the monograph and software electronically, so they were delivered by postal mail. From the start, MEGA was intended to be easy-to-use and include solid statistical methods only.
MEGA version 2 (MEGA2), which was coauthored by an additional investigator Ingrid Jakobson, was released in 2001.[3] All the computer programs and the readme files of this version could be sent electronically due to advances in computer technology. Around this time, the leadership of the MEGA project was taken over by Kumar (now at Temple University) and Tamura (now at Tokyo Metropolitan University). The monograph Molecular Evolutionary Genetics Analysis was often used as a textbook for new ways to study molecular evolution.
MEGA has been updated and expanded several times and currently all these versions are available from the MEGA website. The latest release, MEGA7, has been optimized for use on 64-bit computing systems. MEGA is in two version. A graphical user interface is available as a native Microsoft Windows program. A command line version, MEGA-Computing Core (MEGA-CC), is available for native cross-platform operation. The method is widely used and cited. With millions of downloads across the releases, MEGA is cited in more than 85,000 papers. The 5th version has been cited over 25,000 times in 4 years.[4]
Alignment Editor ― Within MEGA, the Alignment Editor is a tool that may be used for editing and building multiple sequence alignments. The Alignment Editor in MEGA includes an integrated tool for both ClustalW and MUSCLE programs. All actions take place in the Analysis Explorer, which can be found in the main menu of MEGA. When a new alignment is being created, the user is presented with three options: create a new alignment, open a saved alignment session, or retrieve sequences from a file (importing sequences from NCBI). Once an option is selected, the user can choose either ClustalW or MUSCLE from the Alignment tab located at the top of the page. Parameters for the selected alignment program can then be specified and a progress bar will appear while the tool is being computer. Aligned sequences will replace unaligned ones in the main section of the Alignment Editor. To perform further analysis in MEGA, it is advisable to save the alignment session in either MEGA or FASTA format.[5]
Trace Data File Viewer/Editor ― The Trace Data File Viewer/Editor has many functionalities in the following three menus. All the commands are used to help specialize searches and alignments in MEGA.
Integrated web browser, sequence fetching ― MEGA comes with a built-in web browser that allows users to access GenBank sequence data from the NCBI website. The integrated web browser can be accessed when creating a new alignment in the Alignment Editor. To successfully use sequences from NCBI, it is advised to change the searches to FASTA format and use the “Add to Alignment” button. Once completed, all the sequences will be imported into the MEGA application.[7]
One of the challenges associated with evolutionary genetic analysis is the presence of ambiguous states such as R, Y, and T. These states often arise from sequence errors or incomplete datasets. However, MEGA offers several resources to handle ambiguous states, including the deletion of sites that have an ambiguity score higher than the Site Coverage Cutoff parameter.[8]
MEGA's extended format allows users to save all data attributes, such as sequence length, nucleotide positions, gaps, and ambiguous states.[9] Additionally, MEGA supports data import from other formats, such as Clustal, which ensures a seamless transition between popular file types.[10]
After importing a dataset, MEGA provides multiple different data viewer options. For example, users can view statistical attributes and select subsets in the Sequence Data Explorer or use the Distance Data Explorer to inspect pairwise distance data.[11] Another feature of MEGA is the visual specification of domain groups. This allows users to group sequences by a specific characteristic and view subsequent phylogenetic trees.
MEGA offers support for modifying the genetic code used for translating DNA sequences. By default, MEGA has 23 built-in genetic code variations including the standard code, vertebrate mitochondrial code, Drosophila mitochondrial code, and yeast mitochondrial code.[12] Users may add, remove, or edit any genetic code table.
Standard | |
Vertebrate Mitochondrial | |
Invertebrate Mitochondrial | |
Yeast Mitochondrial | |
Mold Mitochondrial | |
Protozoan Mitochondrial | |
Coelenterate Mitochondrial | |
Mycoplasma | |
Spiroplasma | |
Ciliate Nuclear | |
Dasycladacean Nuclear | |
Hexamita Nuclear | |
Echinoderm Mitochondrial | |
Euplotid Nuclear | |
Bacterial Plastid | |
Plant Plastid | |
Alternative Yeast Nuclear | |
Ascidian Mitochondrial | |
Flatworm Mitochondrial | |
Blepharisma Mitochondrial | |
Chlorophycean Mitochondrial | |
Trematode Mitochondrial | |
Scenedesmus obliquus Mitochondrial | |
Thraustochytrium Mitochondrial |
In addition, MEGA can also computes the degeneracy of each codon position in a genetic code table as well as the number of synonymous sites and non-synonymous sites using the Nei-Gojobori method.[13]
The Caption Expert is a part of MEGA which provides publication-like detailed captions based on the properties of analysis results. It is a tool that may be used for distance matrix, phylogeny, tests, etc. within MEGA (megasoftware).[14]
MEGA's integrated text file editor enables users to edit text files without the need for another program. Features like columnar block selection-editing aid in the performance of bulk operations, like changing letter case or font size. Additionally, the editor includes line numbers to assist with the navigation of large files and identifying areas of interest.[15]
MEGA also provides several tools to format sequences. For example, the built-in reverse complement utility reverses the order of characters and replaces each with its complement.[16]
The screenshots demonstrate the use of MEGA's reverse complement tool. The original sequence was reversed and each nucleotide was replaced with its complement to produce the reverse complement.
MEGA provides a graphical interface for displaying and manipulating aligned nucleotide and protein sequences.[17] The Sequence Data Explorer has multiple menu functionalities to help with exporting data, searching alignments, changing display features, highlighting sites, and computing statistics:
Substitution Models in MEGA allow various options with different attributes of substitution models for both DNA and protein sequences. You may choose different substitution types, model, etc. to fit best with chosen data. The three main substitution models are 4x4 Rate Matrix, Transition-Transversion Rate Ratio (k1,k2), and Transition-Transversion Rate Bias of R.
Transition-Transversion Rate Ratios (k1, k2) – Transition-Transversion Rate Ratio calculates the ratio rate of Transition(a) to Transversion(b) using the formula k = a/b.[23]
Transition-Transversion Rate Bias (R) — Transition-Transversion Rate Bias of R in MEGA calculates the ratio of the number of transitions to the number of transversions between a pair of sequences. MEGA allows a user to conduct an analysis of the data with a specified value of R. A key takeaway is when R equals 0.5, it means there is no bias towards either a transition or transversion substitution.[23]
MEGA offers several approaches for testing substitution pattern homogeneity, such as composition distance, disparity index, and Monte Carlo tests. These methods are used to determine if different genetic regions evolved under the same selective pressure.
Computation distance measures the variation in nucleotide composition between two sequences. MEGA computes this figure per site and excludes any gaps or missing data. A larger distance suggests that the regions evolved under different selective pressures.[24]
The disparity index evaluates the difference in substitution patterns for a given pair of sequences. This value is calculated per site and is thought to be more dynamic than the chi-square test. A large difference implies that the pattern of substitution was not the same for the given pair of sequences.
The Monte Carlo test is another approach to test substitution pattern homogeneity that involves running a null distribution simulation. MEGA requires the user to specify the number of replicates and a starting seed. For a significant result, many simulations must be performed. Therefore, it is essential to consider the computational cost of the algorithm.
MC + exact simulation | \Theta(N2\alpha+N) | |
MC + tau-leaping | \Theta(N3\alpha-1+N\alpha) | |
MC + midpt. or trap. tau-leaping | \Theta(N2.5\alpha-1+N\alpha/2) | |
MC + Euler for diff. approx. | \Theta(N3\alpha-1+N\alpha) |
N
\alpha
\Theta(N2)
MEGA offers a wide variety of options for calculating evolutionary distance between a pair of nucleotide or amino acid sequences with or without standard errors.[26] Distance methods are divide into three categories, nucleotide, syn-nonsynonymous, and amino acids:
After selecting a distance method, a subset of attributes will become visible when applicable. The attributes are Substitutions to Include, Transition/Transversion Ratio, Pattern among Lineages, and Rates among Sites. For example, if a model has a rate variation, the gamma parameter will become visible. In addition, every distance method provides options for handling gap and missing data, and codon position if applicable.[29]
Every substitution matrix has it own use case. One of the simplest model is the Juke-Cantor, which assumes an equal mutation rates. The Kimura 2-Parameter model extends that model but with distinctions between transition rates (
A\leftrightarrowG
C\leftrightarrowT
\phantom{
A | |
} | |
G |
\leftrightarrow\phantom{
C | |
} | |
T |
A\leftrightarrowT
C\leftrightarrowG
A\leftrightarrowC
G\leftrightarrowT
Large sample Z-test The Z-test is used to compare relative synonymous and nonsynonymous substitutions within a gene sequence, with the main objective of determining positive selection. To perform the Z-test formula, an estimation of the number of synonymous substitutions per synonymous site (dS) and nonsynonymous substitutions per nonsynonymous site (dN) must be account for, along with the variances of the synonymous and nonsynonymous substitutions Var(dS) and Var(dN). The formula used for the Z-test is:
Z = (dN – dS_ / SQRT(Var(dS) + Var(dN))
If dN is greater than dS, it indicates positive selection, while if dN is less than dS, it indicates purifying selection. The output of Z from the formula above will determine if it is a positive or purifying selection. Key factors to determine which selection the output will be is the variances of the synonymous and nonsynonymous sites. These tests are commonly used for analytical formulas or bootstrapping resampling in MEGA.[30]
Fisher's exact test — Fisher's Exact Test examines synonymous and nonsynonymous substitutions in sequences and is referred to as a one-tailed test when analyzing small samples for positive selection. Rejecting the null hypothesis of neutrality occurs when the P-Value is less than 0.05. If the differences per synonymous site exceed those per nonsynonymous site, MEGA assigns a P-Value of 1, indicating purifying selection rather than positive selection.[31] Further research on Fisher's Exact Test, the algorithm is based on the probability distribution of n!. As a conclusion, it could be argued that the time complexity of the algorithm is O(n!). The name for the distribution method is Hypergeometric Distribution (Hoffman).[32]
Tajima's Neutrality Test — The purpose of Tajima's Neutrality Test is to assess the relationship between the number of segregating sites per site and nucleotide diversity. When alleles are selectively neutral the product 4Nv can be estimated in two ways. N represents the effective population size and v is the mutation rate per site. By calculating the difference between these estimates, one can determine if there is evidence of non-neutral evolution.[33]
The molecular clock hypothesis suggests that all sequences have evolved at a constant rate over time. Therefore, the molecular clock test evaluates this statement in conjunction with the data provided by the user. In MEGA, this test is performed by applying a maximum likelihood test to a given tree topology and sequence alignment. This produces two log-likelihood values, one with the clock hypothesis and one without.[34] Another approach offered by MEGA is Tajima's relative rate test. This method compares the number of substitutions per site between different sequences. If the resulting numbers differ by a large factor, the molecular clock hypothesis may not be valid for the given data set.[35]
MEGA offers five methods building a phylogenetic tree:
Each method allows for a bootstrap phylogeny test with any number of replications. Neighbor joining and minimum evolution allows for an interior-branch test instead. Substitution model and parameters are the same as the distance estimation methods.
MEGA provides a graphical interface for displaying a phylogenetic tree based on a variety of options. In the view menu, the tree can be displayed in three different styles: traditional, radiation, or circle. Traditional trees have three different branch styles: rectangular, straight, or curved. The view menu also offers toggling topology scaling, changing font type and size, arranging taxa, showing/hiding various details, and a general option for more control over the tree drawing aspects.[36]
The subtree menu provides options for manipulating the tree, such as swapping branches, flipping lineages order, compressing/expanding subtrees, and moving the tree's root. Subtrees can also be displayed in its own tree explorer with all the same features and options.[37] The compute menu provides options for computing a condensed tree, a consensus tree, or a timetree with or without a molecular clock.[38] The file menu provides options for saving, exporting, printing, and exiting. The tree topology can be exported to a file in MEGA tree format, or for timetrees, exported in a tabular format with relevant information used when constructing the timetree. Other export options include the current timetree calibrations, analysis summary, partition list, and pairwise distances.[39] The tree explorer also provide options to save the current tree display in an image format or to the clipboard under the image menu option. The image format supported are BMP, PNG, PDF, SVG, TIFF, and EMF.[40] If the user chose to build the tree with bootstrap replication, then the tree explorer will have two tabs, one with the original tree and one with the bootstrap consensus tree.