Superfamily database explained
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes.[1] [2] [3] [4] [5] [6] [7] It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies.[8] [9] Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.[10]
Annotations
The SUPERFAMILY annotation is based on a collection of hidden Markov models (HMM), which represent structural protein domains at the SCOP superfamily level.[11] [12] A superfamily groups together domains which have an evolutionary relationship. The annotation is produced by scanning protein sequences from completely sequenced genomes against the hidden Markov models.
For each protein you can:
- Submit sequences for SCOP classification
- View domain organisation, sequence alignments and protein sequence details
For each genome you can:
- Examine superfamily assignments, phylogenetic trees, domain organisation lists and networks
- Check for over- and under-represented superfamilies within a genome
For each superfamily you can:
- Inspect SCOP classification, functional annotation, Gene Ontology annotation,[6] [13] InterPro abstract and genome assignments
- Explore taxonomic distribution of a superfamily across the tree of life
All annotation, models and the database dump are freely available for download to everyone.
Features
Sequence Search
Submit a protein or DNA sequence for SCOP superfamily and family level classification using the SUPERFAMILY HMM's. Sequences can be submitted either by raw input or by uploading a file, but all must be in FASTA format. Sequences can be amino acids, a fixed frame nucleotide sequence, or all frames of a submitted nucleotide sequence. Up to 1000 sequences can be run at a time.
Keyword Search
Search the database using a superfamily, family, or species name plus a sequence, SCOP, PDB, or HMM ID's. A successful search yields the class, folds, superfamilies, families, and individual proteins matching the query.
Domain Assignments
The database has domain assignments, alignments, and architectures for completely sequence eukaryotic and prokaryotic organisms, plus sequence collections.
Comparative Genomics Tools
Browse unusual (over- and under-represented) superfamilies and families, adjacent domain pair lists and graphs, unique domain pairs, domain combinations, domain architecture co-occurrence networks, and domain distribution across taxonomic kingdoms for each organism.
Genome Statistics
For each genome: number of sequences, number of sequences with assignment, percentage of sequences with assignment, percentage total sequence coverage, number of domains assigned, number of superfamilies assigned, number of families assigned, average superfamily size, percentage produced by duplication, average sequence length, average length matched, number of domain pairs, and number of unique domain architectures.
Gene Ontology
Domain-centric Gene Ontology (GO) automatically annotated.
Due to the growing gap between sequenced proteins and known functions of proteins, it is becoming increasingly important to develop a more automated method for functionally annotating proteins, especially for proteins with known domains. SUPERFAMILY uses protein-level GO annotations taken from the Genome Ontology Annotation (GOA) project, which offers high-quality GO annotations directly associated to proteins in the UniprotKB over a wide spectrum of species.[14] SUPERFAMILY has generated GO annotations for evolutionarily closed domains (at the SCOP family level) and distant domains (at the SCOP superfamily level).
Phenotype Ontology
Domain-centric phenotype/anatomy ontology including Disease Ontology, Human Phenotype, Mouse Phenotype, Worm Phenotype, Yeast Phenotype, Fly Phenotype, Fly Anatomy, Zebrafish Anatomy, Xenopus Anatomy, and Arabidopsis Plant.
Superfamily Annotation
InterPro abstracts for over 1,000 superfamilies, and Gene Ontology (GO) annotation for over 700 superfamilies. This feature allows for the direct annotation of key features, functions, and structures of a superfamily.
Functional Annotation
Functional annotation of SCOP 1.73 superfamilies.
The SUPERFAMILY database uses a scheme of 50 detailed function categories which map to 7 general function categories, similar to the scheme used in the COG database.[15] A general function assigned to a superfamily was used to reflect the major function for that superfamily. The general categories of function are:
- Information: storage, maintenance of genetic code; DNA replication and repair; general transcription and translation.
- Regulation: Regulation of gene expression and protein activity; information processing in response to environmental input; signal transduction; general regulatory or receptor activity.
- Metabolism: Anabolic and catabolic processes; cell maintenance and homeostasis; secondary metabolism.
- Intra-cellular processes: cell motility and division; cell death; intra-cellular transport; secretion.
- Extra-cellular processes: inter-, extr-cellular processes like cell adhesion; organismal process like blood clotting or the immune system.
- General: General and multiple functions; interactions with proteins, lipids, small molecules, and ions.
- Other/Unknown: an unknown function, viral proteins, or toxins.
Each domain superfamily in SCOP classes a to g were manually annotated using this scheme[16] [17] [18] and the information used was provided by SCOP,[19] InterPro,[20] [21] Pfam,[22] Swiss Prot,[23] and various literature sources.
Phylogenetic Trees
Create custom phylogenetic trees by selecting 3 or more available genomes on the SUPERFAMILY site. Trees are generated using heuristic parsimony methods, and are based on protein domain architecture data for all genomes in SUPERFAMILY. Genome combinations, or specific clades, can be displayed as individual trees.
Similar Domain Architectures
This feature allows the user to find the 10 domain architectures which are most similar to the domain architecture of interest.
Hidden Markov Models
Produce SCOP domain assignments for a sequence using the SUPERFAMILY hidden Markov models.
Profile Comparison
Find remote domain matches when the HMM search fails to find a significant match. Profile comparison (PRC)[24] for aligning and scoring two profile HMM's are used.
Web Services
Distributed Annotation Server and linking to SUPERFAMILY.
Downloads
Sequences, assignments, models, MySQL database, and scripts - updated weekly.
Use in Research
The SUPERFAMILY database has numerous research applications and has been used by many research groups for various studies. It can serve either as a database for proteins that the user wishes to examine with other methods, or to assign a function and structure to a novel or uncharacterized protein. One study found SUPERFAMILY to be very adept at correctly assigning an appropriate function and structure to a large number of domains of unknown function by comparing them to the databases hidden Markov models.[25] Another study used SUPERFAMILY to generate a data set of 1,733 Fold superfamily domains (FSF) in use of a comparison of proteomes and functionomes for to identify the origin of cellular diversification.[26]
External links
Notes and References
- Wilson. D. Pethica. R. Zhou. Y. Talbot. C. Vogel. C. Madera. M. Chothia. C. Gough. J. Christine Vogel. Cyrus Chothia. Julian Gough (scientist).
- The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Research. 2004-01-01. 0305-1048. 308851. 14681402. D235–D239. 32. suppl 1. 10.1093/nar/gkh117. en. Martin. Madera. Christine. Vogel. Sarah K.. Kummerfeld. Cyrus. Chothia. Julian. Gough.
- Wilson . D. . Madera . M. . Vogel . C. . Chothia . C. . Gough . J. . Cyrus Chothia. The SUPERFAMILY database in 2007: Families and functions . 10.1093/nar/gkl910 . Nucleic Acids Research . 35 . Database issue . D308–D313 . 2007 . 17098927 . 1669749 .
- Gough . J. . The SUPERFAMILY database in structural genomics . Acta Crystallographica Section D . 58 . Pt 11 . 1897–1900 . 2002 . 12393919 . 10.1107/s0907444902015160. free .
- Gough . J. . Julian Gough (scientist). Chothia . C. . Cyrus Chothia. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments . Nucleic Acids Research . 30 . 1 . 268–272 . 2002 . 11752312 . 99153 . 10.1093/nar/30.1.268.
- De Lima Morais . D. A. . Fang . H. . Rackham . O. J. L. . Wilson . D. . Pethica . R. . Chothia . C. . Cyrus Chothia. Gough . J. . 10.1093/nar/gkq1130 . SUPERFAMILY 1.75 including a domain-centric gene ontology method . Nucleic Acids Research . 39 . Database issue . D427–D434 . 2010 . 21062816 . 3013712 .
- 25414345. 2015. Oates. M. E.. The SUPERFAMILY 1.75 database in 2014: A doubling of data. Nucleic Acids Research. 43. Database issue. D227–33. Stahlhacke. J. Vavoulis. D. V.. Smithers. B. Rackham. O. J.. Sardar. A. J.. Zaucha. J. Thurlby. N. Fang. H. Gough. J. 10.1093/nar/gku1041 . 4383889.
- Hubbard . T. J. . Tim Hubbard. Ailey . B. . Brenner . S. E. . Steven E. Brenner. Murzin . A. G. . Chothia . C. . Cyrus Chothia. SCOP: A Structural Classification of Proteins database . Nucleic Acids Research . 27 . 1 . 254–256 . 1999 . 9847194 . 148149 . 10.1093/nar/27.1.254.
- Lo Conte . L. . Ailey . B. . Hubbard . T. J. . Brenner . S. E. . Murzin . A. G. . Chothia . C. . SCOP: A Structural Classification of Proteins database . Nucleic Acids Research . 28 . 1 . 257–259 . 2000 . 10592240 . 102479 . 10.1093/nar/28.1.257.
- Evolution of sequences within protein superfamilies. Naturwissenschaften. 1975-04-01. 0028-1042. 154–161. 62. 4. 10.1007/BF00608697. en. M. O.. Dayhoff. P. J.. McLaughlin. W. C.. Barker. L. T.. Hunt. 1975NW.....62..154D. 40304076 .
- Gough . J. . Karplus . K. . Hughey . R. . Chothia . C. . Cyrus Chothia. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure1 . 10.1006/jmbi.2001.5080 . Journal of Molecular Biology . 313 . 4 . 903–919 . 2001 . 11697912 . 10.1.1.144.6577 .
- Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998-01-01. 1367-4803. 9927713. 846–856. 14. 10. K.. Karplus. C.. Barrett. R.. Hughey. 10.1093/bioinformatics/14.10.846. free.
- Botstein . D. . David Botstein. Cherry . J. M. . Ashburner . M. . Michael Ashburner. Ball . C. A. . Blake . J. A. . Butler . H. . Davis . A. P. . Dolinski . K. . Dwight . S. S. . Eppig . J. T. . Harris . M. A. . Hill . D. P. . Issel-Tarver . L. . Kasarskis . A. . Lewis . S. . Suzanna Lewis. Matese . J. C. . Richardson . J. E. . Ringwald . M. . Rubin . G. M. . Gerald M. Rubin. Sherlock . G. . Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium . . 25 . 1 . 25–29 . 10.1038/75556 . 2000 . 10802651 . 3037419 .
- Barrell. Daniel. Dimmer. Emily. Huntley. Rachael P.. Binns. David. O’Donovan. Claire. Apweiler. Rolf. 2009-01-01. The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucleic Acids Research. en. 37. suppl 1. D396–D403. 10.1093/nar/gkn803. 0305-1048. 2686469. 18957448.
- Tatusov. Roman L. Fedorova. Natalie D. Jackson. John D. Jacobs. Aviva R. Kiryutin. Boris. Koonin. Eugene V. Krylov. Dmitri M. Mazumder. Raja. Mekhedov. Sergei L. 2003-09-11. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 4. 41. 10.1186/1471-2105-4-41. 1471-2105. 222959. 12969510 . free .
- Vogel. Christine. Berzuini. Carlo. Bashton. Matthew. Gough. Julian. Teichmann. Sarah A.. 2004-02-20. Supra-domains: evolutionary units larger than single protein domains. Journal of Molecular Biology. 336. 3. 809–823. 10.1016/j.jmb.2003.12.026. 0022-2836. 15095989. 10.1.1.116.6568.
- Vogel. Christine. Teichmann. Sarah A.. Pereira-Leal. Jose. 2005-02-11. The relationship between domain duplication and recombination. Journal of Molecular Biology. 346. 1. 355–365. 10.1016/j.jmb.2004.11.050. 0022-2836. 15663950.
- Vogel. Christine. Chothia. Cyrus. 2006-05-01. Protein Family Expansions and Biological Complexity. PLOS Computational Biology. 2. 5. 10.1371/journal.pcbi.0020048. 1553-734X. 1464810. 16733546. e48. 2006PLSCB...2...48V . free .
- Andreeva. Antonina. Howorth. Dave. Brenner. Steven E.. Hubbard. Tim J. P.. Chothia. Cyrus. Murzin. Alexey G.. 2004-01-01. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research. 32. Database issue. D226–D229. 10.1093/nar/gkh039. 0305-1048. 308773. 14681400.
- Mulder. Nicola J.. Apweiler. Rolf. Attwood. Teresa K.. Bairoch. Amos. Barrell. Daniel. Bateman. Alex. Binns. David. Biswas. Margaret. Bradley. Paul. 2003-01-01. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Research. 31. 1. 315–318. 0305-1048. 165493. 12520011. 10.1093/nar/gkg046.
- Mulder. Nicola J.. Apweiler. Rolf. Attwood. Teresa K.. Bairoch. Amos. Bateman. Alex. Binns. David. Bradley. Paul. Bork. Peer. Bucher. Phillip. 2005-01-01. InterPro, progress and status in 2005. Nucleic Acids Research. 33. Database Issue. D201–D205. 10.1093/nar/gki106. 0305-1048. 540060. 15608177.
- Finn. Robert D.. Mistry. Jaina. Schuster-Böckler. Benjamin. Griffiths-Jones. Sam. Hollich. Volker. Lassmann. Timo. Moxon. Simon. Marshall. Mhairi. Khanna. Ajay. 2006-01-01. Pfam: clans, web tools and services. Nucleic Acids Research. 34. Database issue. D247–D251. 10.1093/nar/gkj149. 0305-1048. 1347511. 16381856.
- Boeckmann. Brigitte. Blatter. Marie-Claude. Famiglietti. Livia. Hinz. Ursula. Lane. Lydie. Roechert. Bernd. Bairoch. Amos. 2005-11-01. Protein variety and functional diversity: Swiss-Prot annotation in its biological context. Comptes Rendus Biologies. 328. 10–11. 882–899. 10.1016/j.crvi.2005.06.001. 1631-0691. 16286078.
- Profile Comparer: a program for scoring and aligning profile hidden Markov models. Bioinformatics. 2008-11-15. 1367-4803. 2579712. 18845584. 2630–2631. 24. 22. 10.1093/bioinformatics/btn504. Martin. Madera.
- Mudgal. Richa. Sandhya. Sankaran. Chandra. Nagasuma. Srinivasan. Narayanaswamy. 2015-07-31. De-DUFing the DUFs: Deciphering distant evolutionary relationships of Domains of Unknown Function using sensitive homology detection methods. Biology Direct. En. 10. 1. 38. 10.1186/s13062-015-0069-2. 4520260. 26228684 . free .
- Nasir. Arshan. Caetano-Anollés. Gustavo. 2013. Comparative Analysis of Proteomes and Functionomes Provides Insights into Origins of Cellular Diversification. Archaea. 2013 . 648746 . 10.1155/2013/648746 . 24492748 . 3892558. free .