C2orf74, also known as LOC339804, is a protein encoding gene located on the short arm of chromosome 2 near position 15 (2p15).[1] Isoform 1 of the gene is 19,713 base pairs long. C2orf74 has orthologs in 135 different species, including primarily placental mammals and some marsupials.
The protein encoded by the C2orf74 gene has two isoforms, the longest of which (isoform 1) is 187 amino acids in length.[2] This protein is linked to the development of autoimmune disorders such as ankylosing spondylitis[3] and diseases affecting the colon[4] [5] [6]
C2orf74 is a gene located on the plus strand at 2p15 in humans. It is 19,713 base pairs in length beginning at 61,145,116 and ending at 61,164,828 and includes 8 exons. Other genes within its neighborhood include KIAA841, LOC105374759, LOC105374758, LOC339803, AHSA2P, USP34, and SNORA70B.
C2orf74 has 6 validated mRNA products created via alternative splicing that give rise to two different isoforms. An extended version of Isoform 1 has also been sequenced utilizing a 5' in frame start codon, though this protein product is not formally acknowledged as a separate isoform by NCBI.[7]
Transcript variant 1 | NM_001143959.4 | 1097 bp | 5 | 187 aa | 1 | |
Transcript variant 2 | NM_001143960.3 | 851 bp | 4 | 115 aa | 2 | |
Transcript variant 3 | NM_001316317.2 | 737 bp | 3 | 115 aa | 2 | |
Transcript variant 4 | NM_001367069.1 | 1002 bp | 5 | 115 aa | 2 | |
Transcript variant 5 | NM_001367070.1 | 1124 bp | 6 | 115 aa | 2 | |
Transcript variant 6 | NM_001367071.1 | 973 bp | 5 | 115 aa | 2 | |
Transcript variant 1 extension | A8MZ97 | 1097 bp | 5 | 194 aa | 1+ |
There are two known isoforms of the C2orf74 encoded protein. Isoform 1 is derived from transcript variant 1, and is 187 amino acids in length. There is a putative N-terminal extension of this isoform that utilizes a 5' start codon and adds 7 amino acids to the start of isoform 1, bringing the length of the protein up to 194 amino acids. Isoform 2 is derived from any one of transcript variants 2, 3, 4, 5, or 6. It is created using an alternative promoter, features a different 5'UTR, and a shorter N-terminal end that excludes the first 3 exons that comprise the N-terminal end of exon 1. The result is a shorter protein 115 amino acids in length that lacks a highly conserved transmembrane domain featured at the N-terminal end of isoform 1.
Isoform 1 extension | 1 extension | 194 aa | TMEM, DUF | |
Isoform 1 | 1 | 187 aa | TMEM, DUF | |
Isoform 2 | 2,3,4,5,6 | 115 aa | DUF |
The above figure depicts a conceptual translation of isoform 1 of C2orf74 made using SixFrame.[8] Exon boundaries are depicted in blue font. The 5'UTR of this protein is shown to have an upstream in frame stop codon (red), and an upstream in frame start codon (green). The putative N-terminal extension is depicted in light gray. The N-terminal transmembrane domain is highlighted in lavender. Regions conserved among orthologs are highlighted in cyan, while regions prone to deletion are highlighted in gray. Phosphorylation sites are highlighted in red with the phosphorylated amino acid underlined. Significant SNPs are highlighted in pink with a key pictured to the right detailing the type of change and reason for inclusion. Polyadenylation signals in the 3'UTR are highlighted in orange.
Isoform 1 of the C2orf74 protein has a calculated molecular weight of approximately 21 kDa, and a pI of 5.74.[9] [10] It does not display any unique amino acid composition, cysteine spacing, number of multiplets, or periodicity.[11] This protein isoform has a putative 7 aa N-terminal extension It contains a 21 aa transmembrane region at position 7.
The transmembrane region begins 7 amino acids from the N-terminal end of the protein, and ends at the 29th amino acid in humans. This region has been identified by NCBI, as well as being supported by biochemical analysis. The biochemical qualities characterizing this region as a transmembrane region include a neutral charge cluster and a high-scoring hydrophobic segment, as well as alpha-helical secondary structure.[12] This region is also highly conserved among all orthologs, indicating it as a region of functional significance.[13]
The region downstream of the transmembrane region is considered a domain of unknown function (DUF) within pfam 15484. Approximately 52% of this portion of the protein is considered to be disordered, making confidence in prediction of domain function difficult.[14] However, the C-terminal end is highly conserved among all orthologs.
C2orf74 isoform 1 is shown to be dominated primarily by helical secondary structure, with only short regions being predicted to include beta sheet conformations. Predictions of tertiary structure tend to showcase a globular DUF, at the end of a helical transmembrane domain.[15] Structural predictions of isoform 2 which includes only the DUF also appear to be strictly globular in conformation.
The presence of a transmembrane domain indicates that Isoform 1 of the C2orf74 product is found within a membranous cellular structure. Analysis of likely subcellular localization among orthologs indicates the C2orf74 product is most likely found in the nuclear membrane, mitochondria, or endoplasmic reticulum.[16] Immunocytochemical imaging shows C2orf74 to be localized to the centromere, while immunohistochemical imaging shows it to be centralized in the cytosol.
C2orf74 has 3 possible promoters that produce complete protein isoforms. Isoform 1 could be made by either GXP_6040264 or GXP_2056207, though GXP_6040264 shows the most promise, as it has a higher number of CAGE tags (249) than GXP_2056207 (133), and is conserved among several orthologs. Isoform 2 is made by the promoter GXP_649849.[17]
GXP_6040264 contains over 300 transcription factor binding sites, with a fork head domain factor (V$FKHD), a bromodomain and PHD domain transcription factor (V$BPTF), and a sex/testes determining and related HMG box factor (V$SORY) being the most conserved regions among mammals.
C2orf74 is expressed at minimal levels in several cell types. Due to the low levels of expression, meaningful trends in localization are difficult to discern. In situ hybridization of C2orf74 and some RNA sequencing assays indicate potential for localization in the cerebellum.[18] Microarray data from NCBI GEO indicates lower levels of C2orf74 expression in individuals with colorectal tumors such as adenomas or cancerous colorectal tumors when compared to normal mucosa or tumors of non-colorectal origin such as carcinomas.[19]
The 5' region of transcript variant 1 is 232 bp in length and features an upstream in frame stop codon as well as an upstream in frame start codon.[20] When expressed, this start codon would add a 7 aa N-terminal extension to transcript variant 1. Analysis of potential 3D structure of the 5'UTR of isoform 1 shows the presence of 2 hairpin structures. The 5' UTR of transcript variants 2 through 6 differs from that of transcript variant 1. However, the 5' UTR differs a great degree between orthologs, indicating that it may not be a region of great importance in terms of transcriptional regulation.
The 3' UTR is conserved among all human transcript variants, though it does not show significant conservation among mammalian species. It is 301 bp in length, and contains two polyadenylation signals at 981 bp and 1071 bp respectively. It also contains two partially conserved mi-RNA binding sites at 73 bp (has-mir-241) and 270 bp (has-miR-23),[21] though neither of the mi-RNAs predicted to bind appear to be present in the human transcriptome.[22] The human 3'UTR is found to be rich in stem-loop structures
C2orf74 is predicted to have 4 CK2 phosphorylation sites, as well as 3 PKC phosphorylation sites.[23] The presence of CK2 and PKC phosphorylation sites are common among many orthologs. Myristoylation sites are also common among c2orf74 orthologs, though they are less conserved.[24]
Caesin Kinase 2 is a protein kinase that is serine/threonine specific and plays a significant role in cell signaling pathways related to cell cycling, regulation, and development. Association with C2orf74 may implicate it as a member of an intracellular phosphorylation chain governing cell development, and explain its association with conditions such as cancer and autoimmunity.
Protein kinase C is a family of protein kinases that are serine and threonine specific and play a role in regulating a broad range of cellular functions, particularly those involving phosphorylation cascades. As with CK2, C2orf74's association with PKC may implicate it as a signaling molecule involved in a phosphorylation cascade. This may provide context as to the nature of C2orf74's relationship to autoimmune disease and cancer.
C2orf74 first appeared in mammals and is found in animals as distantly related to humans as marsupials.[25] The table below highlights 20 selected orthologs from various mammalian clades arranged by date of divergence from the human lineage. Red tiles indicate high similarity to the human sequence and blue tiles indicate low similarity. In general, the samples follow the pattern in which more recent evolutionary diversion results in more similar genotypes. Notable exceptions, however, include the galago, mouse, and manatee.
The figures below show in more detail the evolutionary history of C2orf74. To the right is a comparison of the divergence rate of C2orf74 compared to that of cytochrome C and fibrinogen alpha. Given that fibrinogen alpha in this figure serves as a standard example of a rapidly changing protein, one can see that C2orf74 is evolving quite quickly.
There are three types of transcription factors that have been predicted to bind to C2orf74. These transcription factors are POT1, SMAGP, and SRPK1.
POT1 is a telomere end binding protein. It is as of yet unclear how this relates to the predicted function of C2orf74 given previous research and predictions of subcellular localization.
SMAGP is a small transmembrane and glycosylated protein.[26] Association with SMAGP makes sense given the subcellular localization of both structures to the nuclear membrane. Its possible that association with SMAGP may aid C2orf74 as a protein complex associated with intracellular signaling pathways.
SRPK1 is a protein kinase localized to the nucleus and cytoplasm. Association with SRPK1 also makes sense for C2orf74 given the subcellular localization of both proteins and implication in phosphorylative processes.
Several studies have been able to link differential C2orf74 functionality to bowel disease. Two separate studies have identified C2orf74 as a potential susceptibility locus for Crohn's disease. Furthermore, various studies reported in NCBI GEO show differential expression of C2orf74 in benign and cancerous colorectal tumor tissues.[27]
Aside from Crohn's disease, C2orf74 has also been found to be a susceptibility locus for ankylosing spondylitis, and generally for other nondescript autoimmune conditions. The SNP believed to play a role in C2orf74's relationship to ankylosing spondylitis is found within the coding region of the gene, and is denoted in the conceptual translation found in the Protein section above.
At 36aa there is a missense SNP that may be either a Tyrosine (Tyr, Y) or an Aspartate (Asp, D). This is caused by a SNP is associated with ankylosing spondylitis can be found at 319 bp on transcript variant 1