DNA code construction refers to the application of coding theory to the design of nucleic acid systems for the field of DNA–based computation.
DNA sequences are known to appear in the form of double helices in living cells, in which one DNA strand is hybridized to its complementary strand through a series of hydrogen bonds. For the purpose of this entry, we shall focus on only oligonucleotides. DNA computing involves allowing synthetic oligonucleotide strands to hybridize in such a way as to perform computation. DNA computing requires that the self-assembly of the oligonucleotide strands happen in such a way that hybridization should occur in a manner compatible with the goals of computation.
The field of DNA computing was established in Leonard M. Adelman's seminal paper.[1] His work is significant for a number of reasons:
This capability for massively parallel computation in DNA computing can be exploited in solving many computational problems on an enormously large scale such as cell-based computational systems for cancer diagnostics and treatment, and ultra-high density storage media.[2]
This selection of codewords (sequences of DNA oligonucleotides) is a major hurdle in itself due to the phenomenon of secondary structure formation (in which DNA strands tend to fold onto themselves during hybridization and hence rendering themselves useless in further computations. This is also known as self-hybridization). The Nussinov-Jacobson[3] algorithm is used to predict secondary structures and also to identify certain design criteria that reduce the possibility of secondary structure formation in a codeword. In essence this algorithm shows how the presence of a cyclic structure in a DNA code reduces the complexity of the problem of testing the codewords for secondary structures.
Novel constructions of such codes include using cyclic reversible extended generalized Hadamard matrices, and a binary approach. Before diving into these constructions, we shall revisit certain fundamental genetic terminology. The motivation for the theorems presented in this article, is that they concur with the Nussinov - Jacobson algorithm, in that the existence of cyclic structure helps in reducing complexity and thus prevents secondary structure formation. i.e. these algorithms satisfy some or all the design requirements for DNA oligonucleotides at the time of hybridization (which is the core of the DNA computing process) and hence do not suffer from the problems of self - hybridization.
A DNA code is simply a set of sequences over the alphabet
l{Q}=\{A,T,C,G\}
Each purine base is the Watson-Crick complement of a unique pyrimidine base (and vice versa) – adenine and thymine form a complementary pair, as do guanine and cytosine. This pairing can be described as follows –
\bar{A}=T,\bar{T}=A,\bar{C}=G,\bar{G}=C
Such pairing is chemically very stable and strong. However, pairing of mismatching bases does occur at times due to biological mutations.
Most of the focus on DNA coding has been on constructing large sets of DNA codewords with prescribed minimum distance properties.For this purpose let us lay down the required groundwork to proceed further.
Let
q=q1q2...qn
n
l{Q}
1\leqslanti\leqslantj\leqslantn
q[i,j]
qiqi+1...qj
q
qR
qRC=\bar{qn}\bar{qn-1
\bar{qi}
qi
For any pair of length-
n
p
q
l{Q}
dH(p,q)
i
pi ≠ qi
R(p,q) | |
dH |
=
R) | |
d | |
H(p,q |
RC | |
d | |
H |
(p,q)=
RC | |
d | |
H(p,q |
)
RC
Another important code design consideration linked to the process of oligonucleotide hybridization pertains to the GC content of sequences in a DNA code. The GC-content,
wGC(q)
q=q1q2...qn
i
qi\in\{G,C\}
w
A generalized Hadamard matrix
H\equivH(n,Cm)
n
x
n
m
Cm=\{e-2\pi\midl=0,...,m-1\}
HH*
nI
I
n
m=p
p
H(n,Cp)
{p}|{n}
E(n,Zp)
H(n,Cp)
n x n
Zp=\{0,1,2,...,p-1\}
(e-2\piil/m)
H(n,Cp)
l
GF(p)
Here, the elements of
E
GF(p)
By definition, a generalized Hadamard matrix
H
(n-1) x (n-1)
H
H
E
Also, the rows of such an exponent matrix satisfy the following two properties: (i) in each of the nonzero rows of the exponent matrix, each element of
Zp
n/p
n(p-1)/p
Let
Cp |
=\left\{1,x,x2,\ldots,x\right\}
x
x=\exp(2\piij/p)
p
p>2
A=
ai | |
(x |
)
B=
bi | |
(x |
)
Cp
N=pt
t
Q=\{ai-bi\modp:i=1,2,\ldots,N\}
nq |
q
GF(p)
Q
Vector
Q
q
GF(p)
Q
t
nq |
=t,q=0,1,\ldots,p-1
The following lemma is of fundamental importance in constructing generalized Hadamard codes.
Lemma. Orthogonality of vectors over
Cp |
p
A,B
N=pt
Cp |
Q
Q
\modp
A,B
Let
V
N
GF(p)
p
V
a(V)
N
N
N
V
V*
a(V*)
a(V)
The goal here is to find cyclic matrix
E=
Ec |
GF(p)
N=pn-1
E
K
g(x)
GF(p)
xN-1 |
K
N
g(x)
N-n
g(x)
N
E
N
xN-1=g(x)h(x)
h(x)
[0,g(x)]
GF(p)
0,1,\ldots,p-1
Consider a monic irreducible polynomial
h(x)
GF(p)
n
g(x)
N-n
g(x)h(x)=xN-1
[0,g(x)]
GF(p)
h(x)|xN-1
g(x)\mod(xN-1)
K
N
H(p,pn)
H(3,9)
h(x)=x2+x+2
g(x)=x6+2x5+2x4+2x2+x+1
g
\{0,1,6\}
\mod8
Let
p
N+1=pn
g(x)
N-n
C=[c0,c1,\ldots,cN-1]
GF(p)
C=[c0,c1,...,cN-1]
g(x)h(x)=xN-1
h(x)
n
Then there exists a p-ary linear cyclic code
\bar{K}
N
K=[0,\bar{K}]
H(p,pn)=xK
x=e2
H
Proof:
First note that
g(x)
xN-1
N-n
Ec
H
Given that
C
GF(p)
C
{Ec}
{Ec}
C
We also see that augmentation of each codeword of
{Ec}
\modp
K
xK
H
Thus from the above property, we see that the core of
E
N=pk-1
Zp
E
(N+1)/p=pk-1
(N+1)(p-1)/p=(p-1)pk-1
N
E
N
N
Zp
Zp
(p-1)pk-1
The following can be inferred from the theorem as explained above. (For more detailed reading, the reader is referred to the paper by Heng and Cooke.[4]) Let
N=pk-1
{p}
{k}\inZ+
g(x)=c0+c1x+c2x2+...+cN-kxN-k
Zp
g(x)h(x)=xN-1
Zp
h(x)\inZp[x]
({c}0,{c}1,\ldots,{c}N-k,{c}N-k+1,\ldots,{c}N-1)
ci=0
Zp
N
g=(c0,c1,\ldots,cN-1)
DNA codes with constant GC-content can obviously be constructed from constant-composition codes (A constant composition code over a k-ary alphabet has the property that the numbers of occurrences of the k symbols within a codeword is the same for each codeword) over
Zp
Zp
l{Q}=\{A,T,C,G\}
3k-1
Z3
0
A
1
T
2
G
l{D}
3k-1
3k
dH |
=2.3k
\bar{G
l{D}
C
RC | |
d | |
H |
(l{D})\geq3k
For any
k\inZ+
D
{3}k-1
{3}k-1
{3}k-1
RC | |
d | |
H |
(D)\geq{3}k-1
g
Each of the following vectors generates a cyclic core of a Hadamard matrix
H(p,pn)
N+1=
pn |
n=3
g(1)=(22201221202001110211210200)
g(2)=(20212210222001012112011100)
Where,
{g(x)}=a0+a1x+...+
n | |
a | |
nx |
Thus, we see how DNA codes can be obtained from such generators by mapping
{0,1,2}
{A,T,G}
We see that all such mappings yield codes with essentially the same parameters. However the actual choice of mapping has a strong influence on the secondary structure of the codewords. For example, the codeword illustrated was obtained from
{g(1)
0-A;1-T;2-G
{g(2)
{g(1)
0-G;1-T;2-A
Perhaps a simpler approach to building/designing DNA codewords is by having a binary mapping by looking at the design problem as that of constructing the codewords as binary codes. i.e. map the DNA codeword alphabet
l{Q}
A\to00
T\to01
C\to10
G\to11
As we can see, the first bit of a binary image clearly determines which complementary pair it belongs to.
Let
q
{b(q)}
q
q
Now, let
b(q)=b0b1b2...b2n-1
Now, let the subsequence
e(q)=b0b2...b2n-2
{b(q)}
o(q)=b1b3b5\ldotsb2n-1
{b(q)}
Thus, for example, for
q=ACGTCC
b(q)=001011011010
Then
e(q)=011011
o(q)=001100
Let us define an even component as
l{E}(l{C})=\{e(x):x\inl{C}\}
l{O}(l{C})=\{o(x):x\inl{C}\}
From this choice of binary mapping, the GC-content of DNA sequence
q
{e(q)}
Hence, a DNA code
l{C}
l{E}(l{C})
Let
l{B}
M
n
{dmin
c\inl{B}
\bar{c
For
w>0
l{Bw
{wH( ⋅ )}
w>0
n\geq2w+\lceil
dmin/2 |
\rceil
l{C}w
l{E}=\left\{a\bar{b}:a,b\inl{B}w\right\}
l{O}=\left\{abRC:a,b\inl{B},a<lexb\right\}
Where
<lex
a<lexb
l{O}
abRC\inl{O}
baRC\notinl{O}
l{O}
The code
l{E}w
{\left\vertl{B}w\right\vert}2
2n
n
Furthermore,
dH(l{E} | |
w |
\geq
dmin) |
R | |
dH |
(l{E}w\geq
dmin) |
l{B}w
l{B}
Also,
dH(a |
\bar{b},dRCcR)=
RC | |
dH(a,d |
)+
dH(\bar{b}, |
cR)=
dH(a, |
dRC)+
dH(c, |
bRC)
Note that
b
d
w
bRC
dRC
n-w
And due to the weight constraint on
w
a,b,c,d\inl{B}w
dH(a |
\bar{b},dRCcR)\geq2\lceil
dmin |
/2\rceil\geq
dmin |
Thus, the code
l{O}
M(M-1)/2
2n
From this, we see that
{dH}((lO))\geq{dmin
l(O)
l{B}
Similarly,
RC | |
{d | |
H |
Therefore, the DNA code
l{C}=
wmax | |
cup | |
w=dmin |
l{C}w
{wmax
1 | |
2 |
M(M-1)
wmax | |
\sum | |
w=dmin |
\left\vert
2 | |
{A | |
w} |
\right\vert
2n
dH(l{B})\geq |
dmin |
RC | |
dH |
(l{B})\geq
dmin |
From the examples listed above, one can wonder what could be the future potential of DNA-based computers?
Despite its enormous potential, this method is highly unlikely to be implemented in home computers or even computers at offices, etc. because of the sheer flexibility and speed as well as cost factors that favor silicon chip based devices used for the computers today.[2]
However, such a method could be used in situations where the only available method is this and requires the accuracy associated with the DNA hybridization mechanism; applications which require operations to be performed with a high degree of reliability.
Currently, there are several software packages, such as the Vienna package,[7] which can predict secondary structure formations in single stranded DNAs (i.e. oligonucleotides) or RNA sequences.