Cosegregation is the transmission to the next generation, of two or more genes in proximity on the same chromosome. Their closeness means that they are genetically linked.[1] It may also represent an interaction estimation probability between any number of loci.
Interaction probability is determined using specified parts of a target gene (loci) and a group of nuclear profiles (NPs).[2] The picture to the right serves to provide visual aid as to how a slice (NP) is taken from the nucleus and loci are searched for within the NP. Cosegregation used within other mathematical models (SLICE[3] and normalized linkage disequilibrium) assist in rendering 3-D visualizations as a smaller process of genome architecture mapping (GAM). These renderings help determine genomic density and radial position.
|
Cosegregation in Genome architecture mapping (GAM) is another process being used to identify the compaction and adjacency of genomic windows. In a study from 2017, cosegregation was used to understand gene-expression-specific contacts in organizing the genome in mammalian nuclei in the larger process of GAM.[3] The results of the study produced complex 3D structures that displayed interactions under certain regions of chromatin contacts and proved that GAM is a useful tool in the genome biologist's skill set that expands the ability to finely dissect 3D chromatin structures, cell types and valuable human samples. A study in 2021 "discovered extensive 'melting' of long genes when they are highly expressed and/or have high chromatin accessibility. The contacts most specific of neuron subtypes contain genes associated with specialized processes, such as addiction and synaptic plasticity, which harbour putative binding sites for neuronal transcription factors within accessible chromatin regions."[6] Both of these studies used mice as models due to their anatomical, physiological, and genetic similarity to humans.[7]
Some of the earliest known studies that have used cosegregation dates back to the early 1980s. Around this time, scientists were conducting experiments on vegetative organisms to see the if there are unique sequences of chloroplast DNA. The process of the experiment was to track the chloroplast gene in each generation by clustering the genes in nucleoids to reduce the number of segregated units. This study was done at the Duke University in the Zoology Department[8] where Karen P. VanWinkle-Swift utilized Pedigree Diagrams to show how the traits and sequences were passed down from parent to child.
Cosegregation is best suited for cases where multiple factors' interactions are under consideration. It can show how different factors are linked and highlight their interactions and connections. For example, if a genetic disorder was identified as related to a certain gene, but is not always present when that gene is, then a cosegregation analysis could help identify other genes that interact with the suspect gene more often than normal. This could lead researchers to discover the combination of genes that manifest the genetic disorder. Cosegregation is being actively used in medical fields like cancer research. It can highlight the strongest connections between genes in cases where cancer develops. This is useful because there often isn't a single gene causing cancer. Rather, cancer can be caused by a multitude of gene combinations. Cosegregation helps to show links between genes that could be forming these combinations.[3]
An example of an application using cosegregation would be finding the normalized linkage disequilibrium (NL) between two loci. Given a 2D dataset (row = genomic window slice, column = nuclear profile (NP)) a "1" was displayed if an NP existed in a window or a "0" otherwise. From this data, the NL could be found using the base
linkage
dmax
A
B
detectionfrequencies
fa
fb
fab
|
|
This formula can be easily programmed into code as seen in the pseudo-code in the figure to the right. The code was written to satisfy the Example described above.
Given a large dataset of nuclear profiles, cosegregation is easily scalable given its simplistic mathematical formulas. The larger the data set that is provided, the more accurate the following equations will be. As depicted in the photo below, the amount of data being added to the equation merely adds linear time adjustments to the original equation. Fortunately, not only is it able to scale dataset sizes well, it is able to take as many loci of focus that are required to determine the interaction probability. Provided that adding each loci adds a single computation to the equation, a linear time complexity is the result. The picture below shows how the amount of loci affects the detection frequency equation.Finally, the numerical value that results can assist in drawing multiple conclusions including radial position, compaction, and the most influential contacts.
Effective cosegregation analysis depends largely on having a strong supporting dataset because even small inaccuracies can be compounded by cosegregation. A complete understanding of the material is necessary as cosegregation only provides connections between datapoints. The interpretation of those connections must be done through another method. For example, locus cosegregation can give a score of genes that commonly interact with each other, but no matter how strong those relationships are, the results of quantitative cosegregation can seem to support either a correlated, anti-correlated or independent relationships. It is important to be aware of this and follow up cosegregation analysis with another form of analysis, such as normalized linkage disequilibrium to correct for the compounding effect cosegregation can have on negligible variations in the detection frequency of the data.
For example, imagine a simple form of cancer that is trigged by a small number of genes. Here we are examining a suspect gene and three other genes that are suspected to be involved in the processes. This chart shows a hypothetical data set of 10 people and their cancer status as well as if they possess the four genes of interest. Looking at the graph, there is a clear connection between the suspect gene and Gene A. There is also a less obvious interaction between the suspect gene and Gene C that only takes place when Gene B is absent. It is entirely possible that co-segregation would have a hard time determining that relationship. Gene B is commonly present with Gene A and that combination does result in cancer. In a real data set with hundreds or even thousands of genes being examined, one could erroneously conclude that Gene B contributes to the cancer when, in reality it does not and can actually prevent it.
Another limitation of this technique is that many mapping tools measure not only specific physical interactions between genes but also random contacts, the latter being much more common between genes with smaller linear genomic distance this could lead to inflated co-segregation scores. GAM has helped to resolve this issue because in GAM the detection of genomic windows is independent of any interactions with other regions. This allows for an expected interaction value to be calculated and combining this with the co-segregation results to filter out the noise of random connections this will provide a cleaner result.[3]
Matrices are a rectangular structured array of numbers (entries) where the entries can be summed, subtracted, multiplied, and divided using the standard math operations. In the case of co-segregation, Graph theory is used to see if a variable shares an edge or vertex with another variable on a network of nodes. Graph theory is the mathematical study of objects using pairwise relations that is shown through connected nodes called vertices that are connected to other nodes by edges.
The image above depicts the conversion from a cosegregation matrix to an adjacency matrix is one use of a matrix in genome architecture mapping where scientists are using cryosectioning to find colocalization between DNA regions, genomes, and/or alleles. In that example, cosegregation is being used to describe the linkage of data to each other in terms of the distance between specific windows in a genome. The values in the cosegregation matrix were found using the formula above. Comparing windows, the formula seeks to find the intersection of Nuclear Profiles between the respective windows. The genomic windows would be the nodes and the adjacency graph is the matrix depiction of the edges connecting each node.
A heat map is a visual representation of a matrix of that can show different phenomenons on a two-dimensional scale. Heat maps have a range of color intensities based on the values and scale given from the data. Coding-wise, heat maps can be created using libraries such as plotly.express in Python. Using co-segregation, heat maps are used to visualize a matrix that contains values of either 1 or 0 to visualize the commonalities between 2 or more variables. "The primary benefit of using heat maps is that they make otherwise dull or impenetrable data understandable. Many people understand heat maps intuitively, without even needing to be told that those warmer colors indicate a denser focus of interactions."[9]
In the limitation section, there are two heat maps (also put below for easy viewing) shown depicting the difference between normalized and un-normalized data. Showing the difference in the graphs would help the researcher identify different patterns based on the intensity of the color gradients as well as the clustering of data points. Cosegregation results as seen above can have different forms and visualizing them in heat maps can aid researchers in understanding which genomes are connected similar to matrices.
One limitation to heat maps are that some software does not allow the use of locating specific points on the graph, especially if there are many variables. There are coding libraries such as plotly.express that can create interactive heat maps where the programmer can hover over specified points on a graph and read the exact dependent variable's value. Another limitation is that heat maps do not represent real-time data. Since heat maps work by aggregating data over time, it does not show recent changes in behavior compared to the more dominant patterns already present.[9]