The three-base periodicity property in the field of Genomics is a property that is characteristic of protein-coding DNA sequences. The existence of this property can be shown by performing Fourier analysis on signals derived from segments of DNA sequences. Because of its predictive power, it has been used as a preliminary indicator in gene prediction.
DNA sequences are inherently signals as they are functions of an independent variable, position on the sequence. Thus, signal processing methods can be applied to them after the symbolic string is properly mapped to one (or more) numerical sequences. The reason for this periodicity is due to the biased distribution towards codon triplets, which is a consequence of genetic code degeneracy; while non-coding segments are uniformly randomly distributed and produce no significant signal in the frequency space.
This property has been dissected, tested and derived in a chronology of papers from different universities. The initial discovery was made in 1980 by Trifonov and Sussman who observed periodicity in DNA sequences by applying the autocorrelation function to chromatin DNA. Silverman and Linsker defined the Fourier transform of a sequence of bases, described how to "fourier analyze" it and proposed sample applications of this technique. Tsonis, Elsner and Tsonis did Fourier analysis of coding, non-coding and random sequences and proposed a reason for the 3-periodicity property found in coding sequences. Dodin proposed a method for analyzing the periodicity of DNA sequences based on the correlation function of the symbolic sequence. Tiwari, Ramachandran, Bhattacharya and Ramaswamy examined the signal-to-noise ratio of the period-3 peak within a sliding window over a sequence to identify likely coding regions.
DNA stores the information required to assemble, maintain and reproduce every living organism. A protein is a large molecule ("macromolecule") made up of smaller subunits, amino acids. DNA sequences are made up of codons, three-long nucleotide stretches, that correspond to specific amino acids. DNA creates RNA which then helps synthesize proteins. Thus, coding DNA is defined as the sections of the genome that are actually transcribed into amino acids in proteins. Noncoding DNA is sections of a DNA sequence that don't necessarily code for proteins. Identification of coding regions is important, as this information can be used in gene identification and then more generally full-genome annotation.
The 3-periodicity property states that the spectral energy
|S[k]|2=|A[k]|2+|T[k]|2+|C[k]|2+|G[k]|2
N
k=N/3
Formally, a DNA sequence
s[i]
(A,C,T,G)
s[i]=\{A,T,G,C,A,G,C\}
Define a binary signal for each nucleotide,
U\alpha[i]
\alpha
u\alpha[i]=\begin{array}{cc} \{& \begin{array}{cc} 1&s[i]is\alpha\\ 0&o/w \end{array} \end{array}
This creates four signals which encode the position of the four nucleotides in the sequence. For the above example, the projected signals would be ATGCAGC,
ua
uc
ut
ug
Let
E(i,j)
s[i]=s[j]
0
c[t]=
N | |
\sum | |
i=t |
E(i,i-t)
c[0]
c(0)=7
c(1)=0
c(2)=0
c(3)=2
c(4)=1
c(5)=0
c(6)=0
For mathematical purposes, a gene sequence could be viewed as a signal by mapping each nucleotide into the range [1,4]. For the example above and map
\{A->1,T->2,C->3,G->4\}
1243143
This process generates a single signal for the sequence, but raises questions such as which out of 24 (4!) maps should be used and what effect does this map have on our analysis. Unlike the other two mapping methods, this one is not invariant to changing the labeling of the bases but having the same structure (i.e. AACA -> TTGT), which could be detrimental for some applications.
Once the DNA sequence has been converted into a numerical sequence, spectral analysis can be performed on that sequence. Recall the DFT is defined by the analysis equation
X[k]=
N-1 | |
\sum | |
n=0 |
x[n] ⋅ e-2,
and produces a N-long complex signal
X[k]
k
\omega
2\pik/N
k=N/3
Recall that the power spectra of a sequence is equal to the magnitude of the frequency vector squared,
P[k]=|X[k]|2
Tiwari applied the DFT to analysis of DNA sequences using the binary projection operator. They calculated the spectra of the four projected nucleotide sequences using the DFT. Call them
CA[k],CC[k],CT[k],CG[k]
C[k]=|CA[k]|2+|CC[k]|2+|CT[k]|2+|CG[k]|2.
Then, they calculated the Signal-to-Noise ratio of this signal and performed a threshold test on that value to determine whether or not that stretch of DNA is coding.
Instead of computing a Discrete Fourier Transform on different segments of the signal, the analysis can be performed in the time-domain through the use of an anti-notch filter at frequency
N/3
N/3
N/3
h[n]=e2
and thus the frequency content of the output of an anti-notch filtered signal is
X[k]=A[N/3]
k=N/3
|A[N/3]|2
Spectrograms are a good way to view how the frequency content of a signal changes over time. The most common way to compute a spectrogram is to compute a Fourier transform over different segments of the signals, convert the frequency magnitude plot into an image, and concatenate those images. This is a useful way to visually identify coding and non-coding regions of DNA and to inspect other patterns that might exist.
Identifying genes in a DNA sequence is harder than just finding what segments are coding and near impossible to identify by visually inspecting spectrograms. Genes are made up of both coding and non-coding regions, called introns and exons. Thus, the transition between coding and non-coding regions must be examined and analyzed properly to identify genes. Computing the "level" of 3-periodicity over different (possibly overlapping) windows of the sequence generates a plot of 3-periodicity over time.
These long stretches of coding vs. non-coding can then be classified as introns or exons and the entire segment heuristically labeled as gene or non-gene.
Binary signals can be parsed into something called a position count function (PCF), which counts the number of one's at phase
s
\omega
A
A | |
C | |
w |
(s)=
| ||||
\sum | ||||
i=0 |
A[\omegai+s].
For
\omega=3
A(0) | |
(C | |
3 |
A(1) | |
C | |
3 |
A(2)) | |
C | |
3 |
\{0,3,6,...\}
\{1,4,7,...\}
\{2,5,8,...\}
N/3
|\tilde{A}[N/3]|2=
1 | |
2 |
A(0) | |
[(C | |
3 |
-
A(1)) | |
C | |
3 |
2+
A(1) | |
(C | |
3 |
-
A(2)) | |
C | |
3 |
2+
A(2) | |
(C | |
3 |
-
A(0)) | |
C | |
3 |
2].
In other words, the spectral power at
N/3
N/3
If the codons are sampled uniformly at random, as they would be in a noncoding segment of DNA, there is a high chance that the PCFs would not differ by a significant amount and the power at that frequency will be low.
However, in a protein-coding sequence, the DNA sequence is made up of a string of codons which correspond to amino acids. Because the genetic code is degenerate (more than one codon map to a single amino acid) and samples from the amino acids rather than the codons, the codons are not sampled uniformly thus leading to differences in the PCFs.
There is also empirical evidence for why this method works. In other words, over multiple studies this method has been able to discriminate coding vs. non-coding DNA segments. These are discussed in the next section.
This method has been applied to sequence data from a number of organisms, details of which can be found in the references section. A few will be summarized here.
Tiwari, who wrote the paper to first apply DFT to analyzing periodicity of DNA sequences, applied this method to S.cerevisiae and H.influenzuae. For S.cerevisiae, they were able to locate 413 out of 483 probable genes (ORFs). For H.influenzuae, they were able to locate 167 out of 194 identified genes. In both studies, they had a zero false-positive rate.
Datta and Asif analyzed the algorithm's ability to identify coding regions of different lengths in chromosome III of C. elegans. Longer coding sequences are detected with higher probability. This seems to be a consequence of the Uncertainty principle (shorter-time signals spread out in frequency content) and the fact that fewer codons are provided in shorter sequences.
The method can be run on any DNA sequence, where as other methods such as BLAST, FASTA and Smith-Waterman require empirical data.
This is because the total spectral power is
|S[k]|2=|A[k]|2+|T[k]2+|C[k]|2+|G[k]|2,
and no one base contributes more than another.
This is due to the property that shifting a sequence does not change the magnitude of its Discrete Fourier Transform.
x[n-n0]\leftrightarrowX[k]
-j2\pikn0/N | |
e |
|X[k]
-j2\pikn0/N | |
e |
|=|X[k]|