GeneMark | |
GeneMark | |
Author: | Bioinformatics group of Mark Borodovsky |
Developer: | Georgia Institute of Technology |
Released: | 1993 |
Operating System: | Linux, Windows, and Mac OS |
License: | Free binary-only for academic, non-profit or U.S. Government use |
GeneMark is a generic name for a family of ab initio gene prediction algorithms and software programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in the complementary DNA strand) or being "non-coding". The original GeneMark (developed before the advent of the HMM applications in Bioinformatics) was an HMM-like algorithm; it could be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM model of DNA sequence.
The GeneMark.hmm algorithm (1998) was designed to improve accuracy of prediction of short genes and gene starts. The idea was to use the inhomogeneous Markov chain models introduced in GeneMark for computing likelihoods of the sequences emitted by the states of a hidden Markov model, or rather semi-Markov HMM, or generalized HMM describing the genomic sequence. The borders between coding and non-coding regions were formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was added to the GHMM model to improve accuracy of gene start prediction. The next important step in the algorithm development was introduction of self-training or unsupervised training of the model parameters in the new gene prediction tool GeneMarkS (2001). Rapid accumulation of prokaryotic genomes in the following years has shown that the structure of sequence patterns related to gene expression regulation signals near gene starts may vary. Also, it was observed that prokaryotic genome may exhibit GC content variability due to the lateral gene transfer. The new algorithm, GeneMarkS-2 was designed to make automatic adjustments to the types of gene expression patterns and the GC content changes along the genomic sequence. GeneMarkS and, then GeneMarkS-2 have been used in the NCBI pipeline for prokaryotic genomes annotation (PGAP)..
Accurate identification of species specific parameters of a gene finding algorithm is a necessary condition for making accurate gene predictions. However, in the studies of viral genomes one needs to estimate parameters from a rather short sequence that has no large genomic context. Importantly, starting 2004, the same question had to be addressed for gene prediction in short metagenomic sequences. A surprisingly accurate answer was found by introduction of parameter generating functions depending on a single variable, the sequence G+C content ("heurisic method" 1999). Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method in 2010 (implemented in MetaGeneMark). Further on, the need to predict genes in RNA transcripts led to development of GeneMarkS-T (2015), a tool that identifies intron-less genes in long transcript sequences assembled from RNA-Seq reads.
In eukaryotic genomes modeling of exon borders with introns and intergenic regions present a major challenge. The GHMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial version of the eukaryotic GeneMark.hmm needed manual compilation of training sets of protein-coding sequences for estimation of the algorithm parameters. However, in 2005, the first self-training eukaryotic gene finder, GeneMark-ES, was developed. A fungal version of GeneMark-ES developed in 2008 features a more complex intron model and hierarchical strategy of self-training. In 2014, in GeneMark-ET the self-training of parameters was aided by extrinsic hints generated by mapping to the genome short RNA-Seq reads. Extrinsic evidence is not limited to the 'native' RNA sequences. The cross-species proteins collected in the vast protein databases could be a source for external hints, if the homologous relationships between the already known proteins and the proteins encoded by yet unknown genes in the novel genome are established. This task was solved upon developing the new algorithm, GeneMark-EP+ (2020). Integration of the RNA and protein sources of the intrinsic hints was done in GeneMark-ETP (2023). Versatility and accuracy of the eukaryotic gene finders of the GeneMark family have led to their incorporation into number of pipelines of genome annotation. Also, since 2016, the pipelines BRAKER1, BRAKER2, BRAKER3 were developed to combine the strongest features of GeneMark and AUGUSTUS.
Notably, gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)