Canu Quick Start¶
Canu specializes in assembling PacBio or Oxford Nanopore sequences. Canu will correct the reads, trim suspicious regions (such as remaining SMRTbell adapter), and then assemble the corrected and cleaned reads into contigs and unitigs.
For eukaryotic genomes, coverage more than 20x is enough to outperform current hybrid methods. Between 30x and 60x coverage is the recommended minimum. More coverage will let Canu use longer reads for assembly, which will result in better assemblies.
Input sequences can be FASTA or FASTQ format, uncompressed or compressed with gzip (.gz), bzip2 (.bz2) or xz (.xz). Zip files (.zip) are not supported.
Canu will auto-detect your resources and scale itself to fit, using all of the resources available (depending on the size of your assembly). You can limit memory and processors used with parameters maxMemory and maxThreads.
Canu will take full advantage of any LSF/PBS/PBSPro/Torque/Slrum/SGE grid available, and do so automagically, even submitting itself for execution. For details, refer to the section on Execution Configuration.
Assembling PacBio data¶
We made a 25X subset FASTQ available here (223MB), which can be downloaded with:
curl -L -o p6.25x.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq
Correct, Trim and Assemble¶
By default, canu will correct the reads, then trim the reads, then assemble the reads to unitigs.
canu \ -p ecoli -d ecoli-auto \ genomeSize=4.8m \ -pacbio-raw p6.25x.fastq
This will use the prefix ‘ecoli’ to name files, compute the correction task in directory ‘ecoli-auto/correction’, the trimming task in directory ‘ecoli-auto/trimming’, and the unitig construction stage in ‘ecoli-auto’ itself. Output files are described in the next section.
Find the Output¶
The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types. Outputs from the assembly tasks are in:
- The sequences after correction, trimmed and split based on consensus evidence. Typically >99% for PacBio and >98% for Nanopore but it can vary based on your input sequencing quality.
- The sequences after correction and final trimming. The corrected sequences above are overlapped again to identify any missed hairpin adapters or bad sequence that could not be detected in the raw sequences.
- The layout provides information on where each read ended up in the final assembly, including contig and positions. It also includes the consensus sequence for each contig.
- The GFA is the assembly graph generated by Canu. Currently this includes the contigs, associated bubbles, and any overlaps which were not used by the assembly.
The fasta output is split into three types:
Everything which could be assembled and is part of the primary assembly, including both unique and repetitive elements. Each contig has several flags included on the fasta def line:
>tig######## len=<integer> reads=<integer> covStat=<float> gappedBases=<yes|no> class=<contig|bubble|unassm> suggestRepeat=<yes|no> suggestCircular=<yes|no>
- Length of the sequence, in bp.
- Number of reads used to form the contig.
- The log of the ratio of the contig being unique versus being two-copy, based on the read arrival rate. Positive values indicate more likely to be unique, while negative values indicate more likely to be repetitive. See Footnote 24 in Myers et al., A Whole-Genome Assembly of Drosophila.
- If yes, the sequence includes all gaps in the multialignment.
- Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
- If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
- If yes, sequence is likely circular. Not implemented.
- alternate paths in the graph which could not be merged into the primary assembly.
- reads which could not be incorporated into the primary or bubble assemblies.
Correct, Trim and Assemble, Manually¶
Sometimes, however, it makes sense to do the three top-level tasks by hand. This would allow trying multiple unitig construction parameters on the same set of corrected and trimmed reads.
First, correct the raw reads:
canu -correct \ -p ecoli -d ecoli \ genomeSize=4.8m \ -pacbio-raw p6.25x.fastq
Then, trim the output of the correction:
canu -trim \ -p ecoli -d ecoli \ genomeSize=4.8m \ -pacbio-corrected ecoli/correction/ecoli.correctedReads.fasta.gz
And finally, assemble the output of trimming, twice:
canu -assemble \ -p ecoli -d ecoli-erate-0.013 \ genomeSize=4.8m \ correctedErrorRate=0.039 \ -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz canu -assemble \ -p ecoli -d ecoli-erate-0.025 \ genomeSize=4.8m \ correctedErrorRate=0.075 \ -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz
The directory layout for correction and trimming is exactly the same as when we ran all tasks in the same command. Each unitig construction task needs its own private work space, and in there the ‘correction’ and ‘trimming’ directories are empty. The error rate always specifies the error in the corrected reads which is typically <1% for PacBio data and <2% for Nanopore data (<1% on newest chemistries).
Assembling Oxford Nanopore data¶
or use the following curl command:
curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta
Canu assembles any of the four available datasets into a single contig but we picked one dataset to use in this tutorial. Then, assemble the data as before:
canu \ -p ecoli -d ecoli-oxford \ genomeSize=4.8m \ -nanopore-raw oxford.fasta
The assembled identity is >99% before polishing.
Assembling With Multiple Technologies/Files¶
Canu takes an arbitrary number of input files/formats. We made a mixed dataset of about 10X of a PacBio P6 and 10X of an Oxford Nanopore run available here
or use the following curl command:
curl -L -o mix.tar.gz http://gembox.cbcb.umd.edu/mhap/raw/ecoliP6Oxford.tar.gz tar xvzf mix.tar.gz
Now you can assemble all the data:
canu \ -p ecoli -d ecoli-mix \ genomeSize=4.8m \ -pacbio-raw pacbio*fastq.gz \ -nanopore-raw oxford.fasta.gz
Assembling Low Coverage Datasets¶
When you have 30X or less coverage, it helps to adjust the Canu assembly parameters. Typically, assembly 20X of single-molecule data outperforms hybrid methods with higher coverage. You can download a 20X subset of S. cerevisae
or use the following curl command:
curl -L -o yeast.20x.fastq.gz http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.20x.fastq.gz
and run the assembler adding sensitive parameters (correctedErrorRate=0.105):
canu \ -p asm -d yeast \ genomeSize=12.1m \ correctedErrorRate=0.105 \ -pacbio-raw yeast.20x.fastq.gz
After the run completes, we can check the assembly statistics:
tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.ctgStore 2 -G yeast/unitigging/asm.gkpStore
lenSuggestRepeat sum 160297 (genomeSize 12100000) lenSuggestRepeat num 12 lenSuggestRepeat ave 13358 lenUnassembled ng10 13491 bp lg10 77 sum 1214310 bp lenUnassembled ng20 11230 bp lg20 176 sum 2424556 bp lenUnassembled ng30 9960 bp lg30 290 sum 3632411 bp lenUnassembled ng40 8986 bp lg40 418 sum 4841978 bp lenUnassembled ng50 8018 bp lg50 561 sum 6054460 bp lenUnassembled ng60 7040 bp lg60 723 sum 7266816 bp lenUnassembled ng70 6169 bp lg70 906 sum 8474192 bp lenUnassembled ng80 5479 bp lg80 1114 sum 9684981 bp lenUnassembled ng90 4787 bp lg90 1348 sum 10890099 bp lenUnassembled ng100 4043 bp lg100 1624 sum 12103239 bp lenUnassembled ng110 3323 bp lg110 1952 sum 13310167 bp lenUnassembled ng120 2499 bp lg120 2370 sum 14520362 bp lenUnassembled ng130 1435 bp lg130 2997 sum 15731198 bp lenUnassembled sum 16139888 (genomeSize 12100000) lenUnassembled num 3332 lenUnassembled ave 4843 lenContig ng10 770772 bp lg10 2 sum 1566457 bp lenContig ng20 710140 bp lg20 4 sum 3000257 bp lenContig ng30 669248 bp lg30 5 sum 3669505 bp lenContig ng40 604859 bp lg40 7 sum 4884914 bp lenContig ng50 552911 bp lg50 10 sum 6571204 bp lenContig ng60 390415 bp lg60 12 sum 7407061 bp lenContig ng70 236725 bp lg70 16 sum 8521520 bp lenContig ng80 142854 bp lg80 23 sum 9768299 bp lenContig ng90 94308 bp lg90 33 sum 10927790 bp lenContig sum 12059140 (genomeSize 12100000) lenContig num 56 lenContig ave 215341
While Canu corrects sequences and has 99% identity or greater with PacBio or Nanopore sequences, for the best accuracy we recommend polishing with a sequence-specific tool. We recommend Quiver for PacBio and Nanopolish for Oxford Nanpore data.
If you have Illumina sequences available, Pilon can also be used to polish either PacBio or Oxford Nanopore assemblies.