MAFFT explained
MAFFT |
Developer: | Kazutaka Katoh |
Latest Release Version: | 7.475 |
Operating System: | UNIX, Linux, Mac, MS-Windows |
Programming Language: | C |
Genre: | Bioinformatics tool |
Licence: | BSD[1] |
In bioinformatics, MAFFT (for multiple alignment using fast Fourier transform) is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform.[2] Subsequent versions of MAFFT have added other algorithms and modes of operation,[3] including options for faster alignment of large numbers of sequences,[4] higher accuracy alignments,[5] alignment of non-coding RNA sequences,[6] and the addition of new sequences to existing alignments.[7]
History
There have been many variations of the MAFFT software, some of which are listed below:
- MAFFT: The first version of MAFFT, created by Kazutaka Katoh in 2002, used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform.
- MAFFT v5: The second generation of the MAFFT software was released in 2005 and was a rewrite of the original MAFFT software. This generation introduced a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.
- MAFFT v6: The third generation, released in 2006, again improved upon the previous versions. It implemented group-to-group alignment, guide trees which had an approximate but faster O(N log N) tree-building algorithm, as well as making the version applicable to larger datasets with ~50,000 sequences.
- MAFFT v7: The fourth generation, released in 2012, improved the speed and accuracy of MAFFT substantially.
- MAFFT v7.511: The most recent version of MAFFT, released in December of 2022, improved upon MAFFT v7 with various bug fixes. One of the most notable being an overhaul to the
'''--merge'''
option, which now includes, enabling iterative refinement, creating a single MSA from multiple sub-MSAs, as well as the combination of '''--merge'''
and '''--seed'''
. There were also several minor enhancements to the speed and accuracy of MAFFT v7.
Algorithm
The MAFFT algorithm works following these 5 steps Pairwise Alignment, Distance Calculation, Guide Tree Construction, Progressive Alignment, Iterative Refinement.[8]
- Pairwise Alignment: This step is used to identify the regions that are similar between the sequences inputted. The algorithm starts by using the inputted sequences executing pairwise alignments across all the sequences. This step's time complexity is O(L^2) where L is the sequence.[9]
- Distance Matrix: Using the calculated pairwise alignments, a distance matrix calculation is done to evaluate the dissimilarity between the alignments based on their alignment scores.[9] The distance calculation step helps organize the sequences based on their similarity. The Distance Matrix's time complexity is O(N^2L^2)[9] where N is the number of sequences and L is the length of the sequence. This time complexity is due to the fact that the distance calculation between pairs of sequences requires comparing every position of each sequence.
- Guide Tree: Using the distance matrix a guide tree is constructed where there is a hierarchical representation of the clusters (each node is a cluster) and the branches included are the distance between the clusters. O(N^2L)[10] is the time complexity for the guide tree construction, where N is the number of sequences.
- Progressive Alignment: Using the guide tree progressive alignment is performed from the leaves to the root. The algorithm uses the inputted sequences and aligns the child nodes to calculate a consensus alignment for the parent node. This step is done until the entire tree is traversed to result a final multiple sequence alignment. The progressive alignment method's time complexity is O(N^2L) + O(NL^2). This is because the first term corresponds to the guide tree calculation stated earlier along with the second term that corresponds to group to group alignment.
- Iterative Alignment: The iterative refinement step repeats the entire process with adjustments to the positions of gaps and insertions to improve the alignment accuracy. The time complexity of the iterative alignment depends on the number of iterations that occur. But generally the time complexity of this method is O(N2L) + O(NL2) where N is the number of sequences, and L is the length of the sequence.
Input/Output
Web Form
Input
This program can take in multiple sequences as input, which can be entered in two ways:
Sequence Input Window
The user can directly enter three or more sequences in the input window in any of the following formats: GCG, FASTA, EMBL (nucleotide only), GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot (protein only). It is important to note that partially formatted sequences are not accepted, and adding a return to the end of the sequence may help certain applications understand the input. It is also advised to avoid using data from word processors as hidden/control characters may be present.[11]
Sequence File Upload
The user can upload a file containing three or more valid sequences in any format mentioned above. Word processor files may yield unpredictable results due to the presence of hidden/control characters, so it is best to save files with the Unix format option to avoid hidden Windows characters. Once the file is uploaded, it can be used as input for multiple sequence alignment.
Text files saved on DOS/Windows format have different line endings than those saved on Unix/Linux. DOS/Windows uses a combination of carriage return and line feed characters ("\r\n") to indicate the end of a line, while Unix/Linux systems use only a line feed character ("\n").[12]
When transferring files between Windows and Unix-based systems, it's important to be aware of these differences to ensure that the line endings are correctly translated. Otherwise, the hidden carriage return characters in the Windows-formatted files may cause issues when viewed or edited on Unix-based systems, and vice versa.
Output
The user will have the option to request the Multiple Sequence Alignment (MSA) to be generated in one of the two available formats:
!Output Format!Description!AbbreviationPearson/FASTA | Pearson or FASTA sequence format | fasta |
ClustalW | ClustalW alignment format without base/residue numbering | clustalw | |
Default value is: Pearson/FASTA [fasta]
Understanding ClustalW output:!Symbol!Definition!Meaning
| asterisk | Conserved sequence (identical) |
| colon | Conservative mutation |
. | period | Semi-conservative mutation |
| blank | Non-conservative mutation |
- | dash | Gap | |
Settings
There are many settings that affect how the MAFFT algorithm works. Adjusting the settings to your needs is the best way to get accurate and meaningful results. The most important settings to understand are: the Scoring Matrix, Gap Open Penalty, and Gap Extension Penalty.
- Scoring Matrix: "Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. “Deep” scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 – 30% identity, while “shallow” scoring matrices (e.g. VTML10 – VTML80), target alignments that share 90 – 50% identity, reflecting much less evolutionary change."[13] In original MAFFT the scoring equation is shown below.
- Gap Open Penalty: A gap penalty is a negative score assigned to a gap in an alignment. It can be constant, where a fixed cost is charged for the gap, or linear, where a fixed cost is charged for each symbol inserted or deleted. An affine gap penalty combines the two by charging a constant penalty for the first symbol of a gap and another constant penalty for each additional symbol inserted or deleted.[14]
- Gap Extension Penalty: Gap extension penalty is a cost score assigned for each additional gap symbol in a gap region in sequence alignment. It is used to discourage the formation of long gap regions. It is typically smaller than the gap opening penalty.[15]
Accuracy and Results
MAFFT is widely considered to be one of the most accurate and versatile tools for multiple sequence alignment in bioinformatics. In fact, studies have shown that MAFFT performs exceptionally well when compared to other popular algorithms such as ClustalW and T-Coffee, particularly for larger datasets and sequences with high degrees of divergence.[16] For example, in a study comparing the performance of various alignment algorithms on increasing sequence lengths, MAFFT's FFT-NS-2 algorithm was found to be the fastest program for all tested sequence sizes. This is due to its use of fast Fourier transform (FFT) algorithms, which enable rapid and accurate alignment of even highly divergent sequences. Because of the use of fast Fourier transform(FFT) the algorithm runs in either O(n^2) or O(n) depending on the given data set. MAFFT takes less CPU runtime than other algorithms that have the same or similar accuracies especially T-Coffee, ClustalW, and Needleman-Wunsch.
Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences,[9] higher accuracy alignments,[17] alignment of non-coding RNA sequences,[18] and the addition of new sequences to existing alignments.[19]
MAFFT stands out among other popular algorithms such as ClustalW and T-Coffee due to its high accuracy, versatility, and range of features. It offers various alignment methods and strategies, including iterative refinement and consistency-based approaches, that further enhance the accuracy and robustness of the alignments. As a result, MAFFT is widely recognized as a powerful tool for multiple sequence alignment and is highly appreciated by the scientific community.[20]
See also
External links
Notes and References
- The base MAFFT software is distributed under the BSD license, while versions for Microsoft Windows are licensed under the GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
- Katoh . Kazutaka . Misawa . Kazuharu . Kuma . Kei-ichi . Miyata . Takashi . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform . Nucleic Acids Research . 30 . 14 . 3059–66 . 2002 . 12136088 . 135756 . 10.1093/nar/gkf436.
- Web site: MAFFT ver.7 - a multiple sequence alignment program . mafft.cbrc.jp . 28 April 2021.
- 10.1093/bioinformatics/btl592 . 17118958 . PartTree: An algorithm to build an approximate tree from a large number of unaligned sequences . Bioinformatics . 23 . 3 . 372–4 . 2006 . Katoh . K . Toh . H . free .
- 16362903 . 2005 . Katoh . K . Improvement in the accuracy of multiple sequence alignment program MAFFT . Genome Informatics. International Conference on Genome Informatics . 16 . 1 . 22–33 . Kuma . K . Miyata . T . Toh . H .
- 10.1186/1471-2105-9-212 . 18439255 . 2387179 . Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework . BMC Bioinformatics . 9 . 212 . 2008 . Katoh . Kazutaka . Toh . Hiroyuki . free .
- 10.1093/bioinformatics/bts578 . 23023983 . 3516148 . Adding unaligned sequences into an existing alignment using MAFFT and LAST . Bioinformatics . 28 . 23 . 3144–6 . 2012 . Katoh . Kazutaka . Frith . Martin C .
- The base MAFFT software is distributed under the BSD license, while versions for Microsoft Windows are licensed under the GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
- Katoh . K. . Standley . D. M. . MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability . Molecular Biology and Evolution . April 2013 . 30 . 4 . 772–780 . 10.1093/molbev/mst010 . 23329690 . 3603318 .
- Katoh . Kazutaka . Hiroyuki . Toh . Recent developments in the MAFFT multiple sequence alignment program . Briefings in Bioinformatics . July 2008 . 9 . 4 . 286–298 . 10.1093/bib/bbn013 . 18372315 . free.
- Web site: MAFFT Help and Documentation - Job Dispatcher Sequence Analysis Tools - EMBL-EBI . 2023-04-24 . www.ebi.ac.uk.
- Web site: Windows vs. Unix Line Endings . 2023-04-27 . www.cs.toronto.edu.
- Pearson . William R. . Selecting the Right Similarity‐Scoring Matrix . Current Protocols in Bioinformatics . October 2013 . 43 . 1 . 3.5.1–3.5.9 . 10.1002/0471250953.bi0305s43 . 24509512 . 3848038 .
- Web site: ROSALIND | Glossary | Gap penalty .
- Carroll . Hyrum . Clement . Mark . Ridge . Perry . Snell . Quinn . Effects of Gap Open and Gap Extension Penalties . Faculty Publications . October 2006 .
- Edgar . Robert . Serafim . Batzoglou . Multiple sequence alignment . Current Opinion in Structural Biology . June 2006 . 16 . 3 . 368–373 . 10.1016/j.sbi.2006.04.004. 16679011 .
- Katoh . Kazutaka . Parallelization of the MAFFT multiple sequence alignment program . Bioinformatics . 2010-04-28 . 26 . 15 . 1899–1900 . 10.1093/bioinformatics/btq224 . 20427515 . 2905546 .
- Kazunori . Yamada . Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees . Bioinformatics . 4 July 2016 . 32 . 21 . 3246–3251 . 10.1093/bioinformatics/btw412 . 27378296 . 5079479 .
- Kazutaka . Katoh . Adding unaligned sequences into an existing alignment using MAFFT and LAST . Bioinformatics . 27 September 2012 . 28 . 23 . 3144–3146 . 10.1093/bioinformatics/bts578 . 23023983 . 3516148 .
- Edgar . R. C. . MUSCLE: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Research . 8 March 2004 . 32 . 5 . 1792–1797 . 10.1093/nar/gkh340 . 15034147 . 390337 .