Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.
This method uses degenerate molecular tags in addition to sequencing adapters to recognize reads originating from each strand of DNA. As the two strands are complementary, true mutations are found at the same position in both strands. In contrast, PCR or sequencing errors result in mutations in only one strand and can thus be discounted as technical error. Duplex sequencing theoretically can detect mutations with frequencies as low as 5 x 10−8 --that is more than 10,000 times higher in accuracy compared to the conventional next-generation sequencing methods.[1] [2]
The estimated error rate of standard next-generation sequencing platforms is 10−2 to 10−3 per base call. With this error rate, billions of base calls that are produced by NGS will result in millions of errors. The errors are introduced during sample preparation and sequencing such as polymerase chain reaction, sequencing, and image analysis errors. While the NGS platforms' error rate is acceptable in some applications such as detection of clonal variants, it is a major limitation for applications that require higher accuracy for detection of low-frequency variants such as detection of intra-organismal mosaicism, subclonal variants in genetically heterogeneous cancers, or circulating tumor DNA.[3] [4] [5]
Several library preparation strategies have been developed that increase accuracy of NGS platforms such as molecular barcoding and circular consensus sequencing method.[6] [7] [8] [9] Like NGS platforms, the data generated by these methods originates from a single strand of DNA, and therefore the errors that are introduced during PCR amplification, tissue processing, DNA extraction, hybridization capture (where used) or DNA sequencing itself can still be distinguished as a true variant. The duplex sequencing method addresses this problem by taking advantage of the complementary nature of two strands of DNA and confirming only variants that are present in both strands of DNA. Because the probability of two complementary errors arising at the same location in both strands is exceedingly low, duplex sequencing increases the accuracy of sequencing significantly.[10]
Duplex sequencing tagged adapters can be used in combination with the majority of NGS adapters. In the figures and workflow section of this article, Illumina sequencing adapters are used as an example following the original published protocol.
Two oligonucleotides are used for this step (Figure 1: Adapter oligos). One of the oligonucleotides contains a 12-nucleotide single-stranded random tag sequence followed by a fixed 5' nucleotide sequence (black sequence in Figure 1). In this step, oligonucleotides are annealed in a complementary region by incubation at the required temporal condition.
The adapters that annealed successfully are extended and synthesized by a DNA polymerase to complete a double-stranded adapter containing complementary tags (Figure 1).
The extended double-stranded adapters are cleaved by HpyCH4III at a specific restriction site located at 3’ side of the tag sequence and will result in a 3’-dT overhang that will be ligated to the 3’-dA overhang on DNA libraries in the adapter ligation step (Figure 1).
Double-stranded DNA is sheared using one of these methods: sonication, enzymatic digestion, or nebulization. Fragments are size selected using Ampure XP beads. Gel-based size selection is not recommended since it can cause melting of DNA double strands and DNA damage due to UV exposure. The size of selected fragments of DNA are subjected to 3’-end-dA-tailing.
In this step, two tagged adapters are ligated from 3’-dT-tails to 3’-dA-tails on both sides of double-stranded DNA library fragments. This process results in double-stranded library fragments that contain two random tags (α and β) on each side that are the reverse complement of each other (Figure 1 and 2). The "DNA: adapter" ratio is crucial in determining the success of ligation.
In the last step of duplex sequencing library preparation, Illumina sequencing adapters are added to the tagged double stranded libraries by PCR amplification using primers containing sequencing adapters. During PCR amplification, both complementary strands of DNA are amplified and generate two types of PCR products. Product 1 derives from strand 1's which have a unique tag sequence (called α in Figure 2) next to the Illumina adapter 1 and product 2 has a unique tag (called β in Figure 2) next to the Illumina adapter 1. (In each strand, tag α is the reverse complement of tag β and vice versa). The libraries containing duplex tags and Illumina adapters are sequenced using the Illumina TruSeq system. Reads that are originating from every single strand of DNA form a group of reads (tag families) that share the same tag. The detected families of reads will be used in the next step for analyzing sequencing data.
Adapter ligation efficiency is very important in successful duplex sequencing. An extra amount of libraries or adapters can affect the DNA to adapter balance, resulting in inefficient ligation and an excess amount of primer dimers, respectively. Therefore, it is important to keep the molar concentration of DNA to adapter at the optimal ratio (0.05).
The efficiency of duplex sequencing depends on the final number of DCSs which is directly related to the number of reads in each family (family size). If the family size is too small then the DCS can not be assembled and if too many reads are sharing the same tag, the data yield will be low. Family size is determined by the amount of DNA template needed for PCR amplification and the dedicated sequencing lane fraction. The optimal tag family size is between 6 and 12 members. To obtain the optimal family size, the amounts of DNA template and the dedicated sequencing lane fraction need to be adjusted. The following formula takes into account the most important variables that can affect depth of coverage (N=40DG÷R) where "N" is the number of reads, "D" is the desired depth of coverage, "G" is the size of DNA target in base pair, and "R" is final read length.
Each duplex sequencing read contains a fixed 5-nucleotide sequence (shown in figures in black) located upstream of the 12-nucleotide tag sequence. The reads are filtered if they do not have the expected 5-nucleotide sequence or have more than nine identical or ambiguous bases within each tag. The two 12-nucleotide tags at each end of the reads are combined and moved to the read header. Two families of reads are formed that originate from the two strands of DNA. One family contains reads with αβ header originating from strand 1 and the second contains reads with βα header originating from strand 2 (Figure 2). The reads are then trimmed by removing the fixed 5-base pair sequence and 4 error-prone nucleotides located at the sites of ligation and end repair. The remaining reads are assembled to consensus sequences using SSCS and DCS assemblies.
Trimmed sequences from the previous step are aligned to the reference genome using a Burrows–Wheeler aligner (BWA) and the unmapped reads are removed. The aligned reads that have the same 24-base pair tag sequence and genomic region are detected and grouped (family αβ and βα in Figure 2). Each group represents a “tag family.” Tag families with fewer than three members are not analyzed. To remove errors that arise during PCR amplification or sequencing, mutations that are supported by less than 70% of the members (reads) are filtered out from the analysis. A consensus sequence is then generated for each family using the identical sequences in each position of the remaining reads. The consensus sequence is called the SSCS. It increases the NGS accuracy to about 20 fold higher; however, this method relies on the sequencing information from single strands of DNA and therefore is sensitive to the errors induced at the first round or before PCR amplification.
The reads from the last step are realigned to the reference genome. In this method, SSCS family pairs that have complementary tags will be grouped (family αβ and βα in Figure 2). These reads originate from two complementary strands of DNA. High confidence sequences are selected based on the perfectly matched base calls of each family. The final sequence is called the DCS. True mutations are those that match perfectly between complementary SSCSs. This step filters out remaining errors raised during the first round of PCR amplification or during sample preparation.
The high error rate (0.01-0.001) of standard NGS platforms introduced during sample preparation or sequencing is a major limitation for the detection of variants present in a small fraction of cells. Due to the duplex tagging system and use of information in both strands of DNA, duplex sequencing has significantly decreased the error rate of sequencing about 10 million fold using both SSCS and DCS method.
It is challenging to identify rare variants accurately using standard NGS methods with a mutation rate of (10−2 to 10−3). Errors that happen early during sample preparation can be detected as rare variants. An example of such errors is C>A/G>T transversion, detected in low frequencies using deep sequencing or targeted capture data and arising due to DNA oxidation during sample preparation.[11] These types of false-positive variants are filtered out by the duplex sequencing method since mutations need to be accurately matched in both strands of DNA to be validated as true mutations. Duplex sequencing can theoretically detect mutations with frequencies as low as 10−8 compared to the 10−2 rate of standard NGS methods.
Another advantage of duplex sequencing is that it can be used in combination with the majority of NGS platforms without making significant changes to the standard protocols.
Because duplex sequencing provides a significantly higher sequencing accuracy and uses information in both strands of DNA, this method needs a much higher sequencing depth and therefore is a costly approach. The expense limits its application to targeted and amplicon sequencing at present time and will not be applicable for whole genome sequencing approaches. However, the application of duplex sequencing for larger DNA targets will be more feasible when the cost of NGS decreases.
Duplex sequencing is a new method and its efficiency was studied in limited applications such as detecting point mutations using targeted capture sequencing.[12] More studies need to be performed to expand the application and feasibility of duplex sequencing to more complex samples with larger numbers of mutations, indels, and copy number variations.
Duplex sequencing and the significant increase of sequencing accuracy has had an important impact on applications such as detection of rare human genetic variants, detection of subclonal mutations involved in mechanisms of resistance to therapy in genetically heterogeneous cancers, screening variants in circulating tumor DNA as a non-invasive biomarker, and prenatal screening for genetic abnormalities in a fetus.
Another application for duplex sequencing is in the detection of DNA/RNA copy numbers by estimating the relative frequency of variants. A method for counting PCR template molecules with application to next-generation sequencing is an example.
A list of required tools and packages for SSCS and DCS analysis can be found online.