(19)
(11)EP 2 986 741 B1

(12)EUROPEAN PATENT SPECIFICATION

(45)Mention of the grant of the patent:
24.07.2019 Bulletin 2019/30

(21)Application number: 14785836.9

(22)Date of filing:  17.04.2014
(51)International Patent Classification (IPC): 
C12Q 1/6869(2018.01)
C12N 15/10(2006.01)
(86)International application number:
PCT/SG2014/000172
(87)International publication number:
WO 2014/171898 (23.10.2014 Gazette  2014/43)

(54)

METHOD FOR GENERATING EXTENDED SEQUENCE READS

VERFAHREN ZUR ERZEUGUNG ERWEITERTER SEQUENZAUSLESUNGEN

MÉTHODE DE GÉNÉRATION DE LECTURES DE SÉQUENCES ÉTENDUES


(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

(30)Priority: 17.04.2013 SG 201302940

(43)Date of publication of application:
24.02.2016 Bulletin 2016/08

(60)Divisional application:
19179386.8

(73)Proprietor: Agency for Science, Technology and Research
Singapore 138632 (SG)

(72)Inventors:
  • QUAKE, Stephen R.
    Stanford California 94305 (US)
  • BURKHOLDER, William F.
    Singapore 138673 (SG)
  • HONG, Lewis Z.
    Singapore 138673 (SG)

(74)Representative: Mewburn Ellis LLP 
City Tower 40 Basinghall Street
London EC2V 5DE
London EC2V 5DE (GB)


(56)References cited: : 
WO-A1-2011/074960
WO-A1-2012/061832
WO-A2-2011/143231
US-A1- 2012 283 145
WO-A1-2012/000445
WO-A1-2013/036929
AU-A1- 2011 274 090
  
  • DANIEL N FRANK: "BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing", BMC BIOINFORMATICS, vol. 10, no. 1, 29 October 2009 (2009-10-29), pages 1-13, XP055289231, GB ISSN: 1471-2105, DOI: 10.1186/1471-2105-10-362
  • ASAN ET AL: "Paired-End Sequencing of Long-Range DNA Fragments for De Novo Assembly of Large, Complex Mammalian Genomes by Direct Intra-Molecule Ligation", PLOS ONE, vol. 7, no. 9, 27 September 2012 (2012-09-27), page e46211, XP055210637, DOI: 10.1371/journal.pone.0046211
  • LEWIS Z HONG ET AL: "BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads", GENOME BIOLOGY, BIOMED CENTRAL LTD., LONDON, GB, vol. 15, no. 11, 19 November 2014 (2014-11-19), page 517, XP021207721, ISSN: 1465-6906, DOI: 10.1186/S13059-014-0517-9
  • BERGLAND EC ET AL.: 'Next-generation sequencing technologies and applications for human genetic history and forensics' INVESTIGATIVE GENETICS vol. 2, no. 1, 2011, page 23, XP021094353 DOI: 10.1186/2041-2223-2-23
  • FRANK DN: 'BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing' BMC BIOINFORMATICS vol. 10, no. 1, 29 October 2009, pages 1 - 13, XP055289231 DOI: 10.1186/1471-2105-10-362
  • ASAN ET AL.: 'Paired-end sequencing of long-range DNA fragments for de novo assembly of large, complex Mammalian genomes by direct intra-molecule ligation''' PLOS ONE vol. 7, no. 9, 27 September 2012, page E46211, XP055210637 DOI: 10.1371/JOURNAL.PONE.0046211
  • LI R ET AL.: 'SOAP: short oligonucleotide alignment program' BIOINFORMATICS vol. 24, no. 5, 01 January 2008, pages 713 - 714, XP001503358 DOI: 10.1093/BIOINFORMATICS/BTN025
  • COSTEA PI ET AL.: 'TagGD: fast and accurate software for DNA Tag generation and demultiplexing''' PLOS ONE vol. 8, no. 3, 04 March 2013, page E57521, XP055289247 DOI: 10.1371/JOURNAL.PONE.0057521
  • 'CASAVA 1.8.2 Quick Reference Guide' ILLUMINA 01 October 2011, pages 1 - 28, XP055289257 Retrieved from the Internet: <URL:http://support.illumina.com/content/da m/illumina-support/documents/myillumina/212 b4ea1-8658-4505-9b42-008eb0a8b300/casava_qr g_15011197c.pdf> [retrieved on 2014-09-23]
  • LIU Y ET AL.: 'Long read alignment based on maximal exact match seeds''.' BIOINFORMATICS vol. 28, no. 18, 15 September 2012, pages I318 - I324, XP055289262 DOI: 10.1093/BIOINFORMATICS/BTS414
  • ETTER PD ET AL.: 'Local de novo assembly of RAD paired-end contigs using short sequencing reads' PLOS ONE vol. 6, no. 4, 12 April 2011, page E18561, XP055289266 DOI: 10.1371/JOURNAL.PONE.0018561
  
Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 99(1) European Patent Convention).


Description

FIELD OF THE INVENTION



[0001] The present invention relates to a method for generating extended sequence reads and in particular, but not exclusively, to a method for generating extended sequence reads of large genomes.

BACKGROUND



[0002] The current maximum read length of next-generation sequencing technologies, such as those developed by Illumina® and Life Technologies™, is around 250 bases. The read length is one of the main factors that determine the quality of a genome assembly. In general, longer reads make better assemblies because they span more repeats.

[0003] Furthermore, increasing the read length of next-generation sequencing technologies enables broader applications such as being able to sequence larger genomes, generate extended sequence reads and be useful for long-range haplotype analysis on targeted genomic regions.

[0004] Therefore, there is a need for an approach to increase the read length of these commercially available sequencing platforms to several kilobases.

[0005] WO2012/061832 discloses artificial transposon sequences having code tags and target nucleic acids containing such sequences.

[0006] AU 2011 274 090 describes a PCR sequencing method comprising 1) providing the sample; 2) amplifying; 3)mixing; 4)breaking; 5) sequencing; 6) splicing. It provides primer tags used in the method as well as the use of the method in genotyping, especially in HLA analysis.

[0007] Berglund et al., (2011) Investigative genetics 2:23 describe next generating sequencing technologies and applications for human genetic history and forensics

SUMMARY OF THE INVENTION



[0008] The invention is as defined in the appended claims.

[0009] In accordance with a first method, there is provided a method for generating extended sequence reads of long DNA molecules in a sample, comprising the steps of:
  1. (i) assigning a specific barcode sequence to each template DNA molecule in a sample to obtain barcode-tagged molecules;
  2. (ii) amplifying the barcode-tagged molecules;
  3. (iii) fragmenting the amplified barcode-tagged molecules to obtain barcode-containing fragments;
  4. (iv) juxtaposing the barcode-containing fragments to random short segments of the original DNA template molecule during the process of generating a sequencing library to obtain demultiplexed reads; and
  5. (v) assembling the demultiplexed reads to obtain extended sequence reads for each DNA template molecule.


[0010] Preferably, the method further comprises the step of labelling the amplified barcode-tagged molecules with biotin.

[0011] Preferably, wherein the step of fragmenting the amplified barcode-tagged molecules comprises the step of subjecting the amplified barcode-tagged molecules to unidirectional deletion from the barcode-distal end of the barcode-tagged molecules to obtain barcode-containing fragments.

[0012] Preferably, wherein the step of fragmenting the amplified barcode-tagged molecules comprises the steps of:
  1. (i) creating a nick at the barcode-distal end of the amplified barcode-tagged molecules;
  2. (ii) performing a nick translation towards the barcode-proximal end; and
  3. (iii) treating with endonuclease the resulting molecules to generate blunt ends, to obtain barcode-containing fragments.


[0013] Preferably, wherein the step of fragmenting the amplified barcode-tagged molecules comprises the step of performing random fragmentation by a mechanical method or an enzymatic method to obtain barcode-containing fragments.

[0014] Preferably, the barcode-containing fragments have lengths ranging from about 300 base pairs to N base pairs, wherein N equals to the length of the original DNA template molecule.

[0015] Preferably, the method further comprises the step of purifying the barcode-containing fragments using streptavidin-coated paramagnetic beads.

[0016] Preferably, wherein the step of purifying the barcode-containing fragments comprises dissociating the biotin-labelled molecules from the streptavidin-coated paramagnetic beads.

[0017] Preferably, wherein the step of purifying the barcode-containing fragments comprises dissociating the biotin-labelled molecules from the streptavidin-coated paramagnetic beads, further comprises the step of circularizing the purified barcode-containing fragments by intramolecular ligation.

[0018] Preferably, the method further comprises the step of ligating sequencing adaptors onto the ends of the barcode-containing fragments prior to the step of juxtaposing the barcode-containing fragments to random short segments of the original DNA template molecule.

[0019] Preferably, the specific barcode sequence is assigned by linker ligation to each template DNA molecule.

[0020] Preferably, wherein the step of amplifying the barcode-tagged molecules is by circularizing the barcode-tagged molecules and performing rolling circle amplification.

[0021] Preferably, the extended sequences reads are compatible for sequencing on sequencing platforms.

[0022] In accordance with a second method, there is provided a system for obtaining extended sequence reads from template molecules of a DNA sequence, comprising:
  1. (i) a quality filtering module for filtering raw paired-end sequence reads from a sequencer by removing read-pairs with low quality scores, removing read-pairs with missing barcode sequences and trimming platform-specific adaptor sequences;
  2. (ii) a barcode analysis module for identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes;
  3. (iii) a demultiplexing module for using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs; and
  4. (iv) an assembly module for assembling demultiplexed reads to obtain extended sequence reads for each template molecule.


[0023] The system in accordance to the second method wherein the DNA sequence is a known sequence, further comprising:
  1. (i) a sequence alignment module for performing paired-end alignment to a reference sequence and removing disconcordant alignments;
  2. (ii) a demultiplexing module for using barcode sequences as identifiers to obtain alignments to individual template molecules and removing duplicate read-pairs in place of the demultiplexing module according to the second aspect of the present invention; and
  3. (iii) a haplotyping module for obtaining pileup of aligned reads at each position along the reference sequence, determining consensus base-call at each position and assembling base-calls to obtain extended sequence reads for each template molecule in place of the assembly module according to the second aspect of the present invention.


[0024] In accordance with a third method, there is provided a computer-readable medium with an executable programme stored thereon, the programme comprising instructions for obtaining extended sequence reads from template molecules of a DNA sequence, wherein the programme instructs a microprocessor to perform the following steps:
  1. (i) filtering raw paired-end sequence reads from a sequencer by removing read-pairs with low quality scores, removing read-pairs with missing barcode sequences and trimming platform-specific adaptor sequences;
  2. (ii) identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes;
  3. (iii) using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs; and
  4. (iv) assembling demultiplexed reads to obtain extended sequence reads for each template molecule.


[0025] The computer-readable medium in accordance to the third method wherein the DNA sequence is a known sequence, wherein the programme instructs the microprocessor to further perform the following steps:
  1. (i) performing paired-end alignment to a reference sequence and removing disconcordant alignments at the step of identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes;
  2. (ii) replacing the step of using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs with the step of using barcode sequences as identifiers to obtain alignments to individual template molecules and removing duplicate read-pairs; and
  3. (iii) replacing the step of assembling demultiplexed reads with the step of obtaining pileup of aligned reads at each position along the reference sequence, determining consensus base-call at each position and assembling base-calls to obtain extended sequence reads for each template molecule.


[0026] The present disclosure provides an approach that can: 1) increase the effective read length of these commercially available sequencing platforms to several kilobases and 2) be broadly applied to obtain long sequence reads from mixed template populations.

[0027] The present disclosure applies the concept of barcoding to generate long sequence reads by providing a technical advance in juxtaposing the assigned barcode to random overlapping segments of the original template.

[0028] The present disclosure relies on assigning barcodes to individual template molecules, allowing for unambiguous assembly of template sequences even for molecules with high sequence similarity. This also means that the present invention will work for sequencing targeted genomic regions or viral genomes.

[0029] Accordingly, disclosed herein are:
  1. a) A method to assign unique DNA barcodes, i.e., a random string of DNA nucleotides, to individual long (>3 kilo bases (kb)) template molecules.
  2. b) A method to juxtapose the assigned DNA barcode to random short segments of the original template molecule during the process of generating a sequencing library.
  3. c) A method of using the DNA barcode associated with each molecule of a sequencing library to identify the template of origin.
  4. d) A method of using DNA barcodes to substantially reduce the error rate of massively parallel sequencing.
  5. e) A method for barcode-directed assembly of short sequence reads to obtain individual template sequences.


[0030] Other aspects and advantages of the present disclosure will become apparent to those skilled in the art from a review of the ensuing description, which proceeds with reference to the following illustrative drawings of preferred methods.

BRIEF DESCRIPTION OF THE DRAWINGS



[0031] The disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1: Barcode assignment and mutation identification. (A) Each template molecule is assigned to a random barcode, which will uniquely identify each template molecule from a heterogeneous sample. Barcodes also perform another function of removing the vast majority of errors introduced by next-generation sequencing. Briefly, barcode assignment primers carrying universal sequences (green rectangle; Uni-A or Uni-B) on their 5'-end will anneal to opposite ends of template. One of the primers will also carry a barcode (blue rectangle). After two rounds of linear amplification, each template molecule will be tagged with a unique barcode. Next, barcode assignment primers are removed by exonuclease digestion, followed by up to 40 cycles of PCR using universal primers to generate whole-genome amplicons. After PCR amplification, mutations that pre-existed on the template (red cross) or errors introduced during barcode assignment cycle #1 or #2 (blue cross) will be found in all molecules associated with a particular barcode (hereafter referred to as "daughter" molecules). In contrast, errors introduced in subsequent steps of library preparation, sequencing, or base-calling (green cross) can be easily identified because they will only be present in a minority of daughter molecules. (B) To generate uniform sequence coverage across each template molecule, each barcode-tagged genome will be subjected to a series of molecular biologic steps that will ultimately ligate sequencing adaptors (red rectangles) onto the ends of overlapping fragments of the genome. The resulting sequence reads will be assigned to the template molecule identified by the barcode.

Figure 2: Methodology to generate library of overlapping fragments tagged with the same barcode. After PCR amplification with universal primers, the PCR amplicons are subjected to unidirectional deletion from the barcode-distal end to achieve a broad size distribution of fragments ranging from ∼300 bp to N bp, where N equals the length of the original template molecule. Barcode-containing fragments are purified using streptavidin-coated paramagnetic beads. Next, the biotinylated DNA is dissociated from the bead/DNA complex and circularized by intramolecular ligation. After circularization, different regions of the template will be juxtaposed to its barcode. The double-stranded DNA circles are subjected to library preparation by adapting a commonly used transposome-based method (Illumina®) to generate molecules that are compatible for sequencing on the Illumina® platform.

Figure 3: Software package for obtaining extra-long sequence reads by reference-assisted assembly.

Figure 4: Software package for obtaining extra-long sequence reads from template molecules of unknown sequence.

Figure 5: Average coverage depth per HBV genome. Average coverage depth per base position across the HBV genome. Data shown is the average coverage from 4,294 unique viral genomes.

Figure 6: Genome coverage per HBV genome. Percentage of each genome covered by at least 5 unique reads. 2,717 high coverage genomes-defined as ≥5 reads across ≥85% of the genome-were recovered.

Figure 7: Allele frequency of mixed clones. Two HBV clones with 17 known SNPs between them were mixed at (A) 1:99 or (B) 1:9 ratios prior to barcoding. BAsE-Seq was performed for each mixed-clone pool. Barcodes were removed from each read-pair prior to alignment and the resulting data was treated as a "bulk" sequencing experiment to determine overall allele frequencies at the SNP positions. The allele frequencies in both libraries were very close to the mixing ratio (indicating that pipetting error and PCR bias were negligible), and were used as control libraries to test the sensitivity and accuracy of our methodology.

Figure 8: Predicted SNV frequencies and background error rate. Single nucleotide variants (SNVs) in Lib_1:99 (control library containing minor haplotype present at 1% frequency) were estimated by calculating the frequency of Clone-1 genotype calls present in 2881 high coverage genome sequences. In this control library containing a minor haplotype present at 1% frequency, true-positive SNVs were significantly separated from background error (p < 0.0001).


DETAILED DESCRIPTION



[0032] The present disclosure applies the concept of barcoding to generate long sequence reads by providing a technical advance in juxtaposing the assigned barcode to random overlapping segments of the original template.

[0033] The present disclosure relies on assigning barcodes to individual template molecules, allowing for unambiguous assembly of template sequences even for molecules with high sequence similarity. This also means that the methods will work for sequencing targeted genomic regions or viral genomes.

[0034] The current maximum read length of next-generation sequencing technologies, such as those developed by Illumina® and Life Technologies™, is around 250 bases. The methods, also known as "Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq)" provides an approach that can: 1) increase the effective read length of these commercially available sequencing platforms to several kilobases and 2) be broadly applied to obtain long sequence reads from mixed template populations. In brief, our method relies on assigning random DNA barcodes to long template molecules (Figure 1A), followed by a library preparation protocol that juxtaposes the assigned barcode to random short segments of the original template (Figures 1B and 2). The resulting molecules are ligated with platform-specific adaptors for next-generation sequencing. Sequence reads are de-multiplexed using the barcode sequence and used to assemble long-range haplotypes that were present on the original template. In practice, we have applied this technology to perform single virion sequencing on the Hepatitis B virus, a DNA virus with a 3.2 kb genome. In general, we anticipate that this technology can be broadly applied to generate extended sequence reads and will be useful for long-range haplotype analysis on targeted genomic regions, or for improving de novo genome and transcriptome assemblies. A detailed description of our protocol is described in the following paragraphs.

[0035] There is described a method for generating extended sequence reads of long DNA molecules (>3 kb), in a sample. The method comprises the steps of: (i) assigning a specific barcode sequence to each template DNA molecule in a sample to obtain barcode-tagged molecules; (ii) amplifying the barcode-tagged molecules; (iii) fragmenting the amplified barcode-tagged molecules to obtain barcode-containing fragments; (iv) juxtaposing the barcode-containing fragments to random short segments of the original DNA template molecule during the process of generating a sequencing library to obtain demultiplexed reads; and (v) assembling the demultiplexed reads to obtain extended sequence reads for each DNA template molecule.
  1. a) Barcode Assignment. In the first step, individual template molecules are assigned with a unique DNA barcode. In our example, two rounds of PCR amplification are performed using primers with template-specific sequence from opposite ends of the molecule (Figure 1A). This will generate uniquely tagged template molecules for preparing libraries and can be broadly applied for assigning barcodes to targeted genomic regions. Both primers contain a universal sequence on their 5'-ends and one of them contains a barcode, i.e., a string of 20 random nucleotides (encodes for >1012 sequences). To ensure that each template molecule is uniquely assigned, the template should be diluted to obtain a relatively small number of genomes (<109) compared to unique barcode sequences.


[0036] Subsequently, barcode-tagged molecules can be clonally amplified by PCR using universal primers and the PCR product can be used to prepare sequencing libraries. In other manifestations where the template sequence is unknown, the barcode can be assigned by ligation of double- or single-stranded DNA linkers carrying a random string of nucleotides flanked by universal sequences.

[0037] The use of unique barcodes to tag individual template molecules has been shown to greatly reduce the error rate of massively parallel sequencing. Using this strategy, mutations that pre-existed on the template and errors introduced during barcode assignment will be found in all daughter molecules.

[0038] In contrast, errors introduced in subsequent steps of library preparation, sequencing, or base-calling can be easily removed because they will only be present in a minority of daughter molecules (Figure 1A). Based on the published error rate of the DNA polymerase used in our protocol, this translates to one error in every 50 template sequences for template molecules that are 3 kb in size. Furthermore, by using barcodes as unique identifiers for individual genomes, sequences associated with each barcode can be assembled into a complete template sequence.
b) Library Preparation. The goal of library preparation is to tag overlapping fragments of each template molecule with its assigned barcode in order to obtain uniform sequence coverage. This concept is illustrated in Figure 1B and a detailed outline of the protocol is shown in Figure 2.

[0039] Firstly, clonally amplified barcode-tagged molecules are deleted from the barcode-distal end to achieve a broad size distribution of fragments ranging from -300 bp to N bp, where N equals the length of the template molecule. Unidirectional deletion can be achieved by protecting the barcode-proximal end with nuclease-resistant nucleotides or a 3'-protruding overhang, and performing time-dependent digestion from the barcode-distal end using a 3' to 5' exonuclease (such as Exonuclease III), followed by treatment with an endonuclease (such as S1 Nuclease or Mung Bean Nuclease) to generate blunt-ends.

[0040] Barcode-containing fragments are purified using streptavidin-coated beads, and these biotinylated fragments will be dissociated and subjected to end repair, such that both ends of the molecules are blunt and 5'-phosphorylated. The end-repaired molecules are circularized by intramolecular ligation using a DNA ligase (such as T4 DNA ligase). Uncircularized molecules will be removed by nuclease treatment (such as a combination of Exonuclease I and Lambda Exonuclease).

[0041] After circularization, different regions from the original template will be juxtaposed to its barcode. The circularized molecules will be used as template for random fragmentation and adaptor tagging using a transposome-based method, such as the Nextera XT kit (Illumina®).

[0042] Importantly, the primers used during PCR enrichment of the sequencing library will be designed such that the second sequencing read will be anchored by the barcode sequence. Thus, this PCR generates double-stranded DNA molecules that are "sequencing-ready". Finally, the PCR products are subjected to size selection before sequencing. A custom sequencing primer that anneals to the forward priming sequence is used for the second sequencing read.

[0043] There are several alternative approaches to generate a broad distribution of barcode-tagged molecules before circularization. One approach involves creating a nick at the barcode-distal end using a nicking endonuclease, nick translation towards the barcode-proximal end using DNA polymerase I, followed by treatment with endonuclease to generate a blunt end. Another approach involves performing random fragmentation using a mechanical method, such as using the Covaris instrument for focused-ultrasonication, or an enzymatic method, such as using the NEBNext dsDNA Fragmentase, followed by purification of barcode-containing fragments using streptavidin-coated paramagnetic beads.

[0044] An alternative, PCR-free approach to clonal amplification is contemplated, such as circularizing the barcoded template and performing rolling circle amplification using phi29 polymerase.

[0045] Barcodes can be assigned by linker ligation. Both linkers will contain universal sequences on their 5'-end to facilitate clonal amplification in the next step. The barcode-containing linker will also contain a unique universal sequence on its 3'-end for primer annealing during the PCR step at the end of the protocol.

[0046] Software packages for obtaining extended or extra-long sequence reads by reference-assisted assembly and for obtaining extended or extra-long sequence reads from template molecules of an unknown sequence are illustrated in Figures 3 and 4, respectively.

[0047] There is described hereinafter a system for obtaining extended sequence reads from template molecules of a DNA sequence. The system comprises (i) a quality filtering module for filtering raw paired-end sequence reads from a sequencer by removing read-pairs with low quality scores, removing read-pairs with missing barcode sequences and trimming platform-specific adaptor sequences; (ii) a barcode analysis module for identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes; (iii) a demultiplexing module for using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs; and (iv) an assembly module for assembling demultiplexed reads to obtain extended sequence reads for each template molecule (Figure 4). The template molecules are long, preferably >3 kb.

[0048] Where the DNA sequence is a known sequence, the system further comprises (i) a sequence alignment module for performing paired-end alignment to a reference sequence and removing disconcordant alignments; (ii) a demultiplexing module for using barcode sequences as identifiers to obtain alignments to individual template molecules and removing duplicate read-pairs in place of the demultiplexing module shown in Figure 4; and (iii) a haplotyping module for obtaining pileup of aligned reads at each position along the reference sequence, determining consensus base-call at each position and assembling base-calls to obtain extended sequence reads for each template molecule in place of the assembly module shown in Figure 4 (Figure 3).

[0049] There is also disclosed a computer-readable medium with an executable programme stored thereon, the programme comprising instructions for obtaining extended sequence reads from template molecules of a DNA sequence, wherein the programme instructs a microprocessor to perform the following steps of (i) filtering raw paired-end sequence reads from a sequencer by removing read-pairs with low quality scores, removing read-pairs with missing barcode sequences and trimming platform-specific adaptor sequences; (ii) identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes; (iii) using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs; and (iv) assembling demultiplexed reads to obtain extended sequence reads for each template molecule.

[0050] Where the DNA sequence is a known sequence, the programme instructs the microprocessor to further perform the following steps of (i) performing paired-end alignment to a reference sequence and removing disconcordant alignments at the step of identifying highly-represented barcodes and re-assigning sequences associated with poorly-represented barcodes; (ii) replacing the step of using barcode sequences as identifiers to obtain reads associated with individual template molecules and removing duplicate read-pairs described above with the step of using barcode sequences as identifiers to obtain alignments to individual template molecules and removing duplicate read-pairs; and (iii) replacing the step of assembling demultiplexed reads described above with the step of obtaining pileup of aligned reads at each position along the reference sequence, determining consensus base-call at each position and assembling base-calls to obtain extended sequence reads for each template molecule.

EXAMPLES



[0051] Hepatitis B virus (HBV), which contains a 3.2 kb dsDNA genome, was used as a template for methodology development and generating proof-of-concept data. The results presented below demonstrate the use of BAsE-Seq to obtain long (∼3.2 kb) sequence reads from individual template molecules, thereby achieving single virion sequencing of HBV.

[0052] HBV DNA was isolated from a chronically infected patient, PCR-amplified to obtain full-length viral genomes, and cloned into a TOPO pCR2.1 vector (Life Technologies™). Sanger sequencing was performed across each clone to obtain full-length sequences, and two clones (Clone-1 and Clone-2) with 17 single nucleotide polymorphisms (SNPs) between them were used as input for barcode assignment. In the results presented hereafter, barcode-tagged whole-genome amplicons from 20,000 template molecules (HBV genomes) were used as input for library preparation using the BAsE-Seq protocol described above.

[0053] Summary statistics from a typical single virion sequencing experiment of HBV are shown in Table 1, and coverage data per template molecule are illustrated in Figures 5 and 6. In this library, 18,143,186 read-pairs were obtained from the MiSeq sequencer (Illumina®), from which 12,004,237 read-pairs contained the barcode in the expected orientation. After trimming for adaptor, barcode tag and universal sequences, and removing reads shorter than 15 bp, 7,336,915 pass-filter read-pairs were used for alignment to a HBV reference genome. From these read-pairs, 97% read-pairs aligned concordantly, and were distributed across 4,294 individual template molecules, 2,717 of which were identified as "high coverage" and were used for constructing long reads.
Table 1. Summary statistics from a BAsE-Seq run of HBV.
 MiSeq run (2 x 150 bp)
Genomes as input 20,000
Sequencing read-pairs 18,143,186
Barcode-associated read-pairs1 12,004,237 (66%)
Pass-filter read-pairs2 7,366,915
Concordantly aligned to HBV genome 7,151,142 (97% of pass-filter)
Unique barcodes observed3 4,294
High coverage HBV genomes4 2,717
1 Read-pairs that contain the barcode in the expected orientation
2 Read-pairs that are ≥15 bp after removal of adaptor and universal sequences
3 Barcodes associated with at least 50 read-pairs
4 ≥5 unique reads per base position across ≥85% of the genome


[0054] To test the sensitivity and accuracy of our methodology in generating long sequence reads, Clone-1 and Clone-2 were mixed at different ratios to generate a mixed template population where Clone-1 is present at approximately 1% or 10% frequency in the sample. BAsE-Seq was performed on each mixed-template pool. Firstly, barcodes were removed from each read-pair prior to alignment and the resulting data was treated as a "bulk" sequencing experiment to determine overall allele frequencies at the SNP positions. The minor allele frequencies in both libraries were very close to the mixing ratio-0.98% for the "1% pool" (Lib_1:99) and 13.44% for the "10% pool" (Lib_1:9)-indicating that the mixed template pool was generated correctly and PCR bias was negligible (Table 2 and Figure 7). Subsequently, the "bulk" sequence data was de-multiplexed using barcode sequences and sequence reads from individual template molecules were analyzed to obtain ∼3.2 kb reads. Using the long sequence reads, 17-SNP haplotypes were generated for each template molecule. In Lib_1:9, 240 molecules carried a Clone-1 haplotype and 1,639 molecules carried a Clone-2 haplotype, corresponding to a 12.77% minor haplotype frequency. In Lib_1:99, 20 molecules carried a Clone-1 haplotype and 1,912 molecules carried a Clone-2 haplotype, corresponding to a 1.04% minor haplotype frequency. Importantly, chimeric sequences where Clone-1 and Clone-2 SNPs were found on the same molecule were present at ≤0.1% frequency. Furthermore, the use of barcodes to correct for sequencing errors resulted in a very low error rate for BAsE-Seq, allowing for significant separation of true sequence variants from background noise in Lib_1:99 (Table 2 and Figure 8).
Table 2. Detection of low frequency haplotypes by BAsE-SEq
 Lib_1:99Lib_1:9
Mixing ratio (Clone-1 vs. Clone-2) 1:99 10:90
Expected minor clone frequency1 0.98% 13.44%
Observed minor clone haplotypes (Clone-1/Clone-2)2 1.04% (20/1912) 12.77% (240/1639)
Chimeric haplotypes 0.10% (2/1912) 0.06% (1/1639)
1 Based on average allele frequency of Clone-1 SNPs from "bulk" sequencing analysis
2 Based on 17-SNP haplotypes observed in the data from individual template molecules


[0055] Those skilled in the art will appreciate that the invention described herein is susceptible to variations and modifications other than those specifically described. The invention includes all such variation and modifications. The invention also includes all of the steps, features, formulations and compounds referred to or indicated in the specification, individually or collectively and any and all combinations or any two or more of the steps or features.

[0056] The present disclosure is not to be limited in scope by any of the specific methods described herein. These methods are intended for the purpose of exemplification only. Functionally equivalent products, formulations and methods are clearly within the scope of the disclosure as described herein.

[0057] The method described herein may include one or more range of values (e.g. size, concentration etc). A range of values will be understood to include all values within the range, including the values defining the range, and values adjacent to the range which lead to the same or substantially the same outcome as the values immediately adjacent to that value which defines the boundary to the range.

[0058] Throughout this specification, unless the context requires otherwise, the word "comprise" or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers. It is also noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as "comprises", "comprised", "comprising" and the like can have the meaning attributed to it in U.S. Patent law; e.g., they can mean "includes", "included", "including", and the like; and that terms such as "consisting essentially of' and "consists essentially of' have the meaning ascribed to them in U.S. Patent law, e.g., they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the disclosure.

[0059] Other definitions for selected terms used herein may be found within the detailed description of the invention and apply throughout. Unless otherwise defined, all other scientific and technical terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which the invention belongs.


Claims

1. A method for generating extended sequence reads of long DNA molecules in a sample, comprising the steps of:

(i) assigning a specific barcode sequence to each template DNA molecule in a sample wherein barcode-tagged molecules are obtained by PCR amplification using primers comprising template-specific sequence from opposite ends of the template molecule and a universal sequence at the 5'-end, wherein one primer additionally comprising a barcode;

(ii) clonally amplifying the barcode-tagged molecules;

(iii) labelling the amplified barcode-tagged molecules with biotin;

(iv) fragmenting the amplified barcode-tagged molecules to obtain barcode-tagged molecules by subjecting the amplified barcode-tagged molecules to unidirectional deletion from the barcode-distal end;

(v) purifying the barcode-containing fragments using streptavidin-coated paramagnetic beads;

(vi) dissociating the biotin-labelled molecules from the streptavidin-coated paramagnetic beads;

(vii) juxtaposing the barcode-containing fragments to random short segments of the original DNA template molecule and circularizing the barcode-containing fragments by intramolecular ligation, thereby generating a sequencing library of overlapping fragments of each template molecule with its assigned barcode; and

(viii) obtaining demultiplexed reads from the sequencing library

(ix) assembling the demultiplexed reads to obtain extended sequence reads for each DNA template molecule.


 
2. The method according to claim 1, wherein the barcode-containing fragments have lengths ranging from about 300 base pairs to N base pairs, wherein N equals to the length of the DNA template molecule.
 
3. The method of any preceding claim, further comprising the step of ligating sequencing adaptors onto the ends of the barcode-containing fragments prior to the step of juxtaposing the barcode-containing fragments prior to the step of juxtaposing the barcode-containing fragments to random short segments of the original DNA template molecule.
 
4. The method of any preceding claim, wherein the step of amplifying the barcode-tagged molecules is by circularizing the barcode-tagged molecules and performing rolling circle amplification.
 
5. The method of any preceding claim, wherein the extended sequences reads are compatible for sequencing on sequencing platforms.
 


Ansprüche

1. Verfahren zur Herstellung erweiterter Sequenzauslesungen langer DNA-Moleküle in einer Probe, das die folgenden Schritte umfasst:

(i) Zuordnen einer spezifischen Barcodesequenz zu jedem Matrizen-DNA-Molekül in einer Probe, wobei barcodierte Moleküle durch PCR-Amplifizierung unter Verwendung von Primern erhalten werden, die matrizenspezifische Sequenzen von gegenüberliegenden Enden des Matrizen-moleküls und eine universelle Sequenz an dem 5'-Ende umfassen, wobei ein Primer zusätzlich dazu einen Barcode umfasst;

(ii) klonales Amplifizieren der barcodierten Moleküle;

(iii) Markieren der amplifizierten barcodierten Moleküle mit Biotin;

(iv) Fragmentieren der amplifizierten barcodierten Moleküle, um barcodierte Moleküle zu erhalten, indem die amplifizierten barcodierten Moleküle von dem barcodierten distalen Ende einer unidirektionalen Deletion unterzogen werden;

(v) Reinigen der einen Barcode enthaltenden Fragmente unter Verwendung von Streptavidin-beschichteten paramagnetischen Perlen;

(vi) Trennen der Biotin-markierten Moleküle von den Streptavidin-beschichteten paramagnetischen Perlen;

(vii) Nebeneinanderstellen der einen Barcode enthaltenden Fragmente und zufälliger kurzer Segmente des Original-DNA-Matrizen-Moleküls und Zirkularisieren der Barcode enthaltenden Fragmente durch intramolekulare Ligation, wodurch eine Sequenzbibliothek von überlappenden Fragmenten jedes Matrizen-Moleküls mit seinem zugeordneten Barcode erzeugt wird; und

(viii) Erhalten von demultiplexierten Auslesungen aus der Sequenzbibliothek;

(ix) Assemblieren der demultiplexierten Auslesungen, um erweiterte Sequenzauslesungen für jedes DNA-Matrizen-Molekül zu erhalten.


 
2. Verfahren nach Anspruch 1, wobei die einen Barcode enthaltenden Fragmente Längen aufweisen, die von etwa 300 Basenpaaren bis zu N Basenpaaren variieren, wobei N der Länge des DNA-Matrizen-Moleküls entspricht.
 
3. Verfahren nach einem der vorangegangenen Ansprüche, das weiters den Schritt des Ligierens von Sequenzadaptoren an die Enden der einen Barcode enthaltenden Fragmente umfasst, und zwar vor dem Schritt des Nebeneinanderstellens der einen Barcode enthaltenden Fragmente und zufälliger kurzer Segmente des Original-DNA-Matrizen-Moleküls.
 
4. Verfahren nach einem der vorangegangenen Ansprüche, wobei der Schritt der Amplifizierung der barcodierten Moleküle durch Zirkularisieren der barcodierten Moleküle und durch Durchführen von Rolling-Circle-Amplifizierung erfolgt.
 
5. Verfahren nach einem der vorangegangenen Ansprüche, wobei die erweiterten Sequenzauslesungen für das Sequenzieren auf Sequenzierplattformen kompatibel sind.
 


Revendications

1. Procédé de génération de lectures de séquence étendues de longues molécules d'ADN dans un échantillon, comprenant les étapes consistant à :

(i) attribuer une séquence de code-barres spécifique à chaque molécule d'ADN matrice dans un échantillon, où les molécules marquées par un code-barres sont obtenues par une amplification par PCR en utilisant des amorces comprenant une séquence spécifique de matrice à partir des extrémités opposées de la molécule matrice et une séquence universelle à l'extrémité 5', où une amorce comprend en outre un code-barres ;

(ii) amplifier de manière clonale les molécules marquées par un code-barres ;

(iii) marquer les molécules marquées par un code-barres amplifiées avec la biotine ;

(iv) fragmenter les molécules marquées par un code-barres amplifiées afin d'obtenir des molécules marquées par un code-barres en soumettant les molécules marquées par un code-barres amplifiées à une délétion unidirectionnelle à partir de l'extrémité distale du code-barres ;

(v) purifier les fragments contenant un code-barres en utilisant des billes paramagnétiques revêtues de streptavidine ;

(vi) dissocier les molécules marquées par la biotine à partir des billes paramagnétiques revêtues de streptavidine ;

(vii) juxtaposer les fragments contenant un code-barres à de courts segments aléatoires de la molécule d'ADN matrice d'origine et circulariser les fragments contenant un code-barres par une ligature intramoléculaire, en générant ainsi une banque de séquençage des fragments se chevauchant de chaque molécule matrice avec son code-barres attribué ; et

(viii) obtenir des lectures démultiplexées à partir de la banque de séquençage

(ix) assembler les lectures démultiplexées afin d'obtenir des lectures de séquence étendues pour chaque molécule d'ADN matrice.


 
2. Procédé selon la revendication 1, dans lequel les fragments contenant un code-barres possèdent des longueurs comprises entre environ 300 paires de bases et N paires de bases, où N est égal à la longueur de la molécule matrice d'ADN.
 
3. Procédé selon une quelconque revendication précédente, comprenant en outre l'étape consistant à ligaturer des adaptateurs de séquençage sur les extrémités des fragments contenant un code-barres avant l'étape de juxtaposition des fragments contenant un code-barres à de courts segments aléatoires de la molécules d'ADN matrice d'origine.
 
4. Procédé selon une quelconque revendication précédente, dans lequel l'étape consistant à amplifier les molécules marquées par un code-barres est réalisée en circularisant les molécules marquées par un code-barres et en effectuant une amplification par cercle roulant.
 
5. Procédé selon une quelconque revendication précédente, dans lequel les lectures de séquences étendues sont compatibles pour un séquençage sur des plateformes de séquençage.
 




Drawing























Cited references

REFERENCES CITED IN THE DESCRIPTION



This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description