An epigenome-wide association study (EWAS) is an examination of a genome-wide set of quantifiable epigenetic marks, such as DNA methylation, in different individuals to derive associations between epigenetic variation and a particular identifiable phenotype/trait. When patterns change such as DNA methylation at specific loci, discriminating the phenotypically affected cases from control individuals, this is considered an indication that epigenetic perturbation has taken place that is associated, causally or consequentially, with the phenotype.[1]
The epigenome is governed by both genetic and environmental factors, causing it to be highly dynamic and complex. Epigenetic information exists in the cell as DNA and histone marks, as well as non-coding RNAs. DNA methylation (DNAm) patterns change over time, and vary between developmental stage and tissue type. The main type of DNAm is at cytosines within CpG dinucleotides which is known to be involved in gene expression regulation. DNAm pattern changes have been extensively studied in complex diseases such as cancer and diabetes. In a normal cell, the bulk genome is highly methylated at CpGs, whereas CpG islands (CPI) at gene promoter regions remain highly unmethylated. Aberrant DNAm is the most common type of molecular abnormality in cancer cells, where the bulk genome becomes globally ‘hypomethylated’ and CPIs in promoter regions become ‘hypermethylated’, usually leading to silencing of tumour suppressor genes.[2] More recently, studies on diabetes have uncovered further evidence to support an epigenetic component of diseases, including differences in disease-associated epigenetic marks between monozygotic twins, the rising incidence of type 1 diabetes in the general population, and developmental reprogramming events in which in utero or childhood environments can influence disease outcome in adulthood.
Post-translational histone modifications include, but are not limited to, methylation, acetylation and phosphorylation on the core histone tails. These post-translational modifications are read by proteins that can then modify the chromatin state at that locus.Epigenetic variation arises in three distinct ways; it can be inherited and be therefore present in all cells of the adult including the germline (a process known as transgenerational epigenetic inheritance; a controversial phenomenon that has not yet been observed in humans); it can occur randomly and be present in a subset of cells in the adult, the amount of which depending on how early in development the variation occurs; or it can be induced as a result of behavioural or environmental factors. EWAS has previously associated changes in methylation with several diseases and complex conditions which do not have a known epidemiology and therefore are crucial for the identification of epigenetic factors that contribute to or are a consequence of pathogenesis of these diseases.[3]
Retrospective studies compare unrelated individuals who fall into two categories, controls without the disease or phenotype of interest, and cases who have the phenotype of interest. An advantage of such studies is that many cohorts of case-control samples already exist with available genotype and expression data that can be integrated with epigenome data. A downside, however, is that they cannot determine whether epigenetic differences are a result of disease-associated genetic differences, post-disease processes or disease-associated drug interventions.
Useful to study transgenerational inheritance patterns of epigenetic marks. A main limitation of EWAS is deciphering if a phenotype is associated with epigenetic changes as a result of a variable in question or a result of previous genomic variants leading to epigenetic alterations. Comparisons between parent and offspring genomic and epigenomic data allows one to rule out the possibility that a disease or phenotype is due to genomic variation. A limitation of this study design is that very few cohorts which are large enough exist.
Monozygotic twins carry identical genomic information. Therefore, if they are discordant for a particular disease or phenotype it is likely a result of epigenetic differences. However, unless the twins are studied longitudinally it is impossible to determine if epigenetic variation is the cause of or consequence of disease. Another limitation is recruiting a large enough cohort of discordant monozygotic twins with the disease of interest.
Longitudinal studies follow a cohort of individuals over an extended period of time, usually from birth or before disease onset. Samples are taken and records are kept over many years, making these studies extremely useful to determine causality of particular phenotypes. Since the same individuals are followed at time points before and after disease onset, it removes the confounding effects of differences between cases and controls. Longitudinal studies are not only useful for risk studies (using DNA samples prior to disease onset), but also in intervention studies using pre- and posttreatment with specific exposures to investigate environmental impacts on the epigenome.[4] A major disadvantage is the long timeline of the studies as well as the expense. Longitudinal studies using disease-discordant monozygotic twins gives the added benefit of ruling out genetic influences on epigenetic variation.
The tissue specificity of epigenomic marks create another challenge when designing an EWAS. Tissue choice is limited by both accessibility and stability of epigenetic patterning. It is crucial to choose a tissue in which epigenetic marks are variable in the population yet stable over time. If this isn't possible, it would be required to use multiple serially collected samples from the same individuals to report robust associations with a particular phenotype. EWAS for diseases are often measured using DNA methylation in blood samples because disease-relevant tissues are difficult to obtain. In some cases, the pattern of methylation is not necessarily biologically relevant to the proposed phenotype. The choice of blood also requires stringent analysis and careful interpretation due to variable cell type composition. Choosing a surrogate tissue therefore requires that the interindividual differences correlate between the tissue of interest and the surrogate, but also for the exposure to induce similar changes in both tissues. To date, an underlying issue is that there is no clear evidence that, in general, epigenetic marks respond to environmental exposures in a similar way across tissues.[5]
The platform for epigenome-wide DNAm quantification utilizes the high throughput technology Illumina Methylation Assay. In the past, the 27k Illumina array covered on average two CpG sites in the promoter regions of approximately 14,000 genes and represented less than 0.1% of the 28 million CpG sites in the human genome. This falls short of being representative of the entire human epigenome. None of the early EWAS using this array [6] [7] used independent validation to verify the associated probes. An interesting observation was a bias in the differences between cases and controls towards non-CpG island probes (which were significantly underrepresented in this array design), arguing strongly for the use of the latterly designed 450k array which does cover non-CpG islands with a higher density of probes. Presently, the Illumina 450k array is the most widely used platform in the last two years for studies reporting EWAS. The array still only covers less than 2% of the CpG sites in the genome, but does attempt to cover all known genes with a high density of probes in the promoters (including CpG islands and surrounding sequences), but also covers with a lower density across the gene bodies, 3′ untranslated regions, and other intergenic sequences.[8]
DNA methylation is typically quantified on a scale of 0–1, as the methylation array measures the proportion of DNA molecules that are methylated at a particular CpG site. The initial analyses performed are univariate tests of association to identify sites where DNA methylation varies with exposure and/or phenotype. This is followed by multiple testing corrections and utilizing an analytical strategy to reduce batch effects and other technical confounding effects in the quantification of DNA methylation. The potential confounding effects arising from alterations in tissue composition is also taken into account. Additionally, adjusting for confounding factors such as age, gender and behaviours that may influence the methylation status as covariates is conducted. The association results are also corrected for the genomic control inflation factor in order to account for the population stratification.
Generally, mean levels of CpG methylation are compared across categories using linear regression[9] which allows for the adjustment of confounders and batch effects.[10] A P-value threshold of P < 1e-7[11] is generally used to identify CpGs associated with the tested phenotype/stimulus. These CpGs are considered to reach epigenome-wide significance. An effect size is also calculated at this significance level, indicating the difference in methylation when comparing two qualitative groups, or different quantitative values depending on your phenotype. CpG sites significantly associated with the phenotype and/or treatment/environmental stimulus are typically represented in a manhattan plot.[12]
Single CpG sites are prone to single site natural variation effects and technical variation such as bad microarray probes and outliers. To make more robust associations and take into account such variation, using adjacent measurements can help increase power.[13] [14] In previous studies, functionally relevant findings have been associated with genomic regions as opposed to single CpGs. Therefore, looking at the regional level can help identify associated regions with more confidence, guiding downstream functional studies.
Another method of analysis is using unsupervised clustering to create classes of CpG sites based on similarity of methylation variation across samples. The average methylation values within each class is used to construct data sets of reduced dimensionality, facilitating efficient tests of association between DNA methylation and phenotypes of interest.[15] This is used to reduce the dimensionality of large data sets and take advantage of substantial biologically induced correlation. This method is useful for identifying gross patterns of methylation associated with the tested variable, but may miss specific CpG sites of interest. Besides differences in mean methylation levels, differences in variation of DNA methylation across samples may also be biologically meaningful, motivating scans for differential variability between groups.
The location of the associated CpG sites or islands/regions can then be analyzed in silico to imply possible functional relevance. For example, considering whether the associated CpGs are within a promoter region or determining distance from the transcription start site that may be relevant, especially when we assume that DNA methylation associated with a phenotype acts by regulating gene transcription. Many other inferences based on past biological knowledge can be inferred if that particular region of CpGs have been studied and associated with changes in transcription. This can be used as an additional filter for identifying regions to pursue for functional validation. Several bioinformatic tools that have been developed for functional enrichment analysis can be applied to differentially methylated regions by first mapping these regions to genes. This is done by mapping the distance between the CpGs and a gene promoter that is potentially regulated by this region. Enrichment analysis based on the genomic region has thus been suggested as a complementary approach and confers substantial interpretive potential. Differentially methylated regions can then be compared to a catalog of genomic regions including, for example, sites enriched for specific chromatin modifications or transcription factor binding sites.
A methylation odds ratio can be calculated if we consider the mean methylation rate at a site in cases (or controls) to represent the methylation probability for a randomly chosen DNA strand in the case (or control) tissue samples. The methylation odds ratio is the odds for a random DNA strand in the tissue sample from a random case to be methylated, divided by the same odds for controls. This provides a measure of effect size that incorporates relative magnitudes, but also does not allow for the difference between cases and controls of features of the methylation spectrum, such as variance. The methylation odds ratio is also comparable across prospective and retrospective studies and its value only measures association and does not imply causation. Methylation risk scores have also been calculated which can integrate information across CpG sites by calculating a weighted methylation risk score as the sum of methylation values at each of the markers associated with the phenotype, weighted by marker-specific effect size[16]
Replication using an independent cohort is required to rule out false positives identified in the initial study. This can be done in a human cohort or in a more focused manner in animal models. It is important that, when selecting the replication cohort, the individuals are reflective of the initial cohort and that the same confounding variables are taken into account. Replication, however, can be limited due to the availability of individuals and samples.
Variations in the epigenome can cause disease but can also arise as a consequence of disease, and distinguishing between the two is a major limitation in EWAS. A way to circumvent this is to determine whether the epigenetic variation is present before any symptoms of disease, preferably via longitudinal studies following the same cohort of people over many years (this in itself has its own setbacks of expense and study time frame). Also needed to be taken into consideration is the possibility that epigenetic variation which arises before disease onset does not necessarily constitute causation for disease.
The most commonly used tissue in EWAS is blood. However, blood samples contain multiple different cell types each of which have a unique epigenetic signature. In this way, it is extremely difficult to determine if the sample you have taken is homogeneous and is therefore difficult to determine if the variation in epigenetic marks are due to the differences in phenotype/stimulus or due to the sample heterogeneity.
Currently many EWAS use blood as a surrogate tissue due to its availability and ease of collection. However, epigenetic changes in the blood may not be associated with the changes in the particular tissue associated with the disease. Many intriguing disorders that could have epigenetic causative factors affect tissues such as brain, lung, heart, etc. However, when studying human patients it is not an option to take these tissues for sampling, and they are therefore left unstudied.
EWASdb[17] (http://www.bioapp.org/ewasdb/) is the first epigenome-wide association database (first online at 2015, and first published on Nucleic Acids Res. 2018 Oct 13) which stores the results of 1319 EWAS studies associated with 302 diseases/phenotypes (p<1e-7). Three types of EWAS results were stored in EWASdb: EWAS for single epi-marker; EWAS for KEGG pathway and EWAS for GO (Gene Ontology) categories.
EWAS Atlas[18] (http://bigd.big.ac.cn/ewas) is a curated knowledgebase of EWAS that provides a comprehensive collection of EWAS knowledge. Unlike extant data-oriented epigenetic resources, EWAS Atlas features manual curation of EWAS knowledge from extensive publications. In the current implementation, EWAS Atlas focuses on DNA methylation—one of the key epigenetic marks; it integrates a large number of 388,851 high-quality EWAS associations, involving 126 tissues/cell lines and covering 351 traits, 2,230 cohorts and 390 ontology entities, which are completely based on manual curation from 649 studies reported in 495 publications. In addition, it is equipped with a powerful trait enrichment analysis tool, which is capable of profiling trait-trait and trait-epigenome relationships. Future developments include regular curation of recent EWAS publications, incorporation of more epigenetic marks and possible integration of EWAS with GWAS. Collectively, EWAS Atlas is dedicated to the curation, integration and standardization of EWAS knowledge and has the great potential to help researchers dissect molecular mechanisms of epigenetic modifications associated with biological traits.
EWAS Data Hub[19] (https://bigd.big.ac.cn/ewas/datahub) is a resource for collecting and normalizing DNA methylation array data as well as archiving associated metadata. The current release of EWAS Data Hub integrates a comprehensive collection of DNA methylation array data from 75 344 samples and employs an effective normalization method to remove batch effects among different datasets. Accordingly, taking advantages of both massive high-quality DNA methylation data and standardized metadata, EWAS Data Hub provides reference DNA methylation profiles under different contexts, involving 81 tissues/cell types (that contain 25 brain parts and 25 blood cell types), six ancestry categories, and 67 diseases (including 39 cancers). In summary, EWAS Data Hub bears great promise to aid the retrieval and discovery of methylation-based biomarkers for phenotype characterization, clinical treatment and health care.