GeneMark explained

GeneMark
GeneMark
Author:	Bioinformatics group of Mark Borodovsky
Developer:	Georgia Institute of Technology
Released:	1993
Operating System:	Linux, Windows, and Mac OS
License:	Free binary-only for academic, non-profit or U.S. Government use

GeneMark is a generic name for a family of ab initio gene prediction algorithms and software programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in the complementary DNA strand) or being "non-coding". The original GeneMark (developed before the advent of the HMM applications in Bioinformatics) was an HMM-like algorithm; it could be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM model of DNA sequence.

Further improvements in the algorithms for gene prediction in prokaryotic genomes

The GeneMark.hmm algorithm (1998) was designed to improve accuracy of prediction of short genes and gene starts. The idea was to use the inhomogeneous Markov chain models introduced in GeneMark for computing likelihoods of the sequences emitted by the states of a hidden Markov model, or rather semi-Markov HMM, or generalized HMM describing the genomic sequence. The borders between coding and non-coding regions were formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was added to the GHMM model to improve accuracy of gene start prediction. The next important step in the algorithm development was introduction of self-training or unsupervised training of the model parameters in the new gene prediction tool GeneMarkS (2001). Rapid accumulation of prokaryotic genomes in the following years has shown that the structure of sequence patterns related to gene expression regulation signals near gene starts may vary. Also, it was observed that prokaryotic genome may exhibit GC content variability due to the lateral gene transfer. The new algorithm, GeneMarkS-2 was designed to make automatic adjustments to the types of gene expression patterns and the GC content changes along the genomic sequence. GeneMarkS and, then GeneMarkS-2 have been used in the NCBI pipeline for prokaryotic genomes annotation (PGAP)..

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of a gene finding algorithm is a necessary condition for making accurate gene predictions. However, in the studies of viral genomes one needs to estimate parameters from a rather short sequence that has no large genomic context. Importantly, starting 2004, the same question had to be addressed for gene prediction in short metagenomic sequences. A surprisingly accurate answer was found by introduction of parameter generating functions depending on a single variable, the sequence G+C content ("heurisic method" 1999). Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method in 2010 (implemented in MetaGeneMark). Further on, the need to predict genes in RNA transcripts led to development of GeneMarkS-T (2015), a tool that identifies intron-less genes in long transcript sequences assembled from RNA-Seq reads.

Eukaryotic gene prediction

In eukaryotic genomes modeling of exon borders with introns and intergenic regions present a major challenge. The GHMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial version of the eukaryotic GeneMark.hmm needed manual compilation of training sets of protein-coding sequences for estimation of the algorithm parameters. However, in 2005, the first self-training eukaryotic gene finder, GeneMark-ES, was developed. A fungal version of GeneMark-ES developed in 2008 features a more complex intron model and hierarchical strategy of self-training. In 2014, in GeneMark-ET the self-training of parameters was aided by extrinsic hints generated by mapping to the genome short RNA-Seq reads. Extrinsic evidence is not limited to the 'native' RNA sequences. The cross-species proteins collected in the vast protein databases could be a source for external hints, if the homologous relationships between the already known proteins and the proteins encoded by yet unknown genes in the novel genome are established. This task was solved upon developing the new algorithm, GeneMark-EP+ (2020). Integration of the RNA and protein sources of the intrinsic hints was done in GeneMark-ETP (2023). Versatility and accuracy of the eukaryotic gene finders of the GeneMark family have led to their incorporation into number of pipelines of genome annotation. Also, since 2016, the pipelines BRAKER1, BRAKER2, BRAKER3 were developed to combine the strongest features of GeneMark and AUGUSTUS.

Notably, gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

GeneMark
GeneMarkS
GeneMarkS-2

Metagenomes and Metatranscriptomes

MetaGeneMark
GeneMarkS-T

Eukaryotes

GeneMark
GeneMark.hmm ^[1]
GeneMark-ES: ab initio gene finding algorithm for eukaryotic genomes with automatic (unsupervised) training.^[2]
GeneMark-ET: augments GeneMark-ES by integrating RNA-Seq read alignments into the self-training procedure.^[3]
GeneMark-EP+: augments GeneMark-ES by iterative finding genes in a novel genome, detecting similarities of predicted genes to known proteins, splice-aligning of the known proteins to the genome and generating hints for the next round of prediction, and correction based on the external evidence.
GeneMark-ETP: integrates genomic, transcript and protein evidence into the gene prediction

Viruses, phages and plasmids

Heuristic models

Transcripts assembled from RNA-Seq read

GeneMarkS-T

References

Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133. DOI
Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. DOI PMID
Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. DOI PMID
Besemer J., Lomsadze A., and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. DOI PMID
Mills R., Rozanov M., Lomsadze A., Tatusova T., and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. DOI PMID
Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. DOI PMID
Lomsadze A., Ter-Hovhannisyan V., Chernoff Y., and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. DOI PMID
Ter-Hovhannisyan V., Lomsadze A., Chernoff Y., and Borodovsky M. "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training." Genome Research (2008) 18 (12): 1979-1990. DOI PMID
Zhu W., Lomsadze A., and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. DOI PMID
Lomsadze A., Burns P.D., and Borodovsky M. "Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm." Nucleic Acids Research (2014) 42 (15): e119. DOI PMID
Tang S., Lomsadze A., and Borodovsky M. "Identification of protein coding regions in RNA transcripts." Nucleic Acids Research (2015) 43 (12): e78. DOI PMID
Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E., Zaslavsky L., Lomsadze A., Pruitt K., Borodovsky M., and Ostell J. "NCBI prokaryotic genome annotation pipeline." Nucleic Acids Research (2016) 44 (14): 6614-6624. DOI PMID
Hoff K., Lange S., Lomsadze A., Borodovsky M., and Stanke M. "BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS." Bioinformatics (2016) 32 (5): 767-769. DOI PMID
Lomsadze A., Gemayel K., Tang S., and Borodovsky M. "Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes." Genome Research (2018) 28 (7): 1079-1089. DOI PMID
Bruna T., Hoff K., Lomsadze A., Stanke M., and Borodovsky M. "BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics (2021) 3 (1): lqaa108 DOI PMID
Bruna T., Lomsadze A., and Borodovsky M. "GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins." NAR Genomics and Bioinformatics (2022) 2 (2): lqaa026 DOI PMID
Bruna T., Lomsadze A., and Borodovsky M. "GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data." bioRxiv (Jan 5, 2023) DOI PMID
Gabriel L., Brůna T., Hoff K., Ebel M., Lomsadze A., Borodovsky M., and Stanke M. "BRAKER3: Fully automated genome annotation using RNA-Seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA." bioRxiv (Nov 27, 2023) DOI PMID

Notes and References

Web site: GeneMark.HMM eukaryotic.
Web site: GeneMark-ES.
Web site: GeneMark-ET – gene finding algorithm for eukaryotic genomes | RNA-Seq Blog. 9 July 2014.