Canu Quick Start

Canu specializes in assembling PacBio or Oxford Nanopore sequences. Canu will correct the reads, trim suspicious regions (such as remaining SMRTbell adapter), and then assemble the corrected and cleaned reads into contigs and unitigs.

For eukaryotic genomes, coverage more than 20x is enough to outperform current hybrid methods. Between 30x and 60x coverage is the recommended minimum. More coverage will let Canu use longer reads for assembly, which will result in better assemblies.

Input sequences can be FASTA or FASTQ format, uncompressed or compressed with gzip (.gz), bzip2 (.bz2) or xz (.xz). Zip files (.zip) are not supported.

Canu will auto-detect your resources and scale itself to fit, using all of the resources available (depending on the size of your assembly). You can limit memory and processors used with parameters maxMemory and maxThreads.

Canu will take full advantage of any LSF/PBS/PBSPro/Torque/Slrum/SGE grid available, and do so automagically, even submitting itself for execution. For details, refer to the section on Execution Configuration.

Assembling PacBio data

Pacific Biosciences released P6-C4 chemistry reads for Escherichia coli K12. You can download them here, but note that you must have the SMRTpipe software installed to extract the reads as FASTQ.

We made a 25X subset FASTQ available here (223MB), which can be downloaded with:

curl -L -o p6.25x.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq

Correct, Trim and Assemble

By default, canu will correct the reads, then trim the reads, then assemble the reads to unitigs.

canu \
 -p ecoli -d ecoli-auto \
 genomeSize=4.8m \
 -pacbio-raw p6.25x.fastq

This will use the prefix ‘ecoli’ to name files, compute the correction task in directory ‘ecoli-auto/correction’, the trimming task in directory ‘ecoli-auto/trimming’, and the unitig construction stage in ‘ecoli-auto’ itself. Output files are described in the next section.

Find the Output

The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types. Outputs from the assembly tasks are in:

ecoli*/ecoli.correctedReads.fasta.gz
The sequences after correction, trimmed and split based on consensus evidence. Typically >99% for PacBio and >98% for Nanopore but it can vary based on your input sequencing quality.
ecoli*/ecoli.trimmedReads.fasta.gz
The sequences after correction and final trimming. The corrected sequences above are overlapped again to identify any missed hairpin adapters or bad sequence that could not be detected in the raw sequences.
ecoli*/ecoli.layout
The layout provides information on where each read ended up in the final assembly, including contig and positions. It also includes the consensus sequence for each contig.
ecoli*/ecoli.gfa
The GFA is the assembly graph generated by Canu. Currently this includes the contigs, associated bubbles, and any overlaps which were not used by the assembly.

The fasta output is split into three types:

ecoli*/asm.contigs.fasta

Everything which could be assembled and is part of the primary assembly, including both unique and repetitive elements. Each contig has several flags included on the fasta def line:

>tig######## len=<integer> reads=<integer> covStat=<float> gappedBases=<yes|no> class=<contig|bubble|unassm> suggestRepeat=<yes|no> suggestCircular=<yes|no>
len
Length of the sequence, in bp.
reads
Number of reads used to form the contig.
covStat
The log of the ratio of the contig being unique versus being two-copy, based on the read arrival rate. Positive values indicate more likely to be unique, while negative values indicate more likely to be repetitive. See Footnote 24 in Myers et al., A Whole-Genome Assembly of Drosophila.
gappedBases
If yes, the sequence includes all gaps in the multialignment.
class
Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
suggestRepeat
If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
suggestCircular
If yes, sequence is likely circular. Not implemented.
ecoli*/asm.bubbles.fasta
alternate paths in the graph which could not be merged into the primary assembly.
ecoli*/asm.unassembled.fasta
reads which could not be incorporated into the primary or bubble assemblies.

Correct, Trim and Assemble, Manually

Sometimes, however, it makes sense to do the three top-level tasks by hand. This would allow trying multiple unitig construction parameters on the same set of corrected and trimmed reads.

First, correct the raw reads:

canu -correct \
  -p ecoli -d ecoli \
  genomeSize=4.8m \
  -pacbio-raw  p6.25x.fastq

Then, trim the output of the correction:

canu -trim \
  -p ecoli -d ecoli \
  genomeSize=4.8m \
  -pacbio-corrected ecoli/correction/ecoli.correctedReads.fasta.gz

And finally, assemble the output of trimming, twice:

canu -assemble \
  -p ecoli -d ecoli-erate-0.013 \
  genomeSize=4.8m \
  correctedErrorRate=0.039 \
  -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz

canu -assemble \
  -p ecoli -d ecoli-erate-0.025 \
  genomeSize=4.8m \
  correctedErrorRate=0.075 \
  -pacbio-corrected ecoli/trimming/ecoli.trimmedReads.fasta.gz

The directory layout for correction and trimming is exactly the same as when we ran all tasks in the same command. Each unitig construction task needs its own private work space, and in there the ‘correction’ and ‘trimming’ directories are empty. The error rate always specifies the error in the corrected reads which is typically <1% for PacBio data and <2% for Nanopore data (<1% on newest chemistries).

Assembling Oxford Nanopore data

A set of E. coli runs were released by the Loman lab. You can download one directly or any of them from the original page.

or use the following curl command:

curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta

Canu assembles any of the four available datasets into a single contig but we picked one dataset to use in this tutorial. Then, assemble the data as before:

canu \
 -p ecoli -d ecoli-oxford \
 genomeSize=4.8m \
 -nanopore-raw oxford.fasta

The assembled identity is >99% before polishing.

Assembling With Multiple Technologies/Files

Canu takes an arbitrary number of input files/formats. We made a mixed dataset of about 10X of a PacBio P6 and 10X of an Oxford Nanopore run available here

or use the following curl command:

curl -L -o mix.tar.gz http://gembox.cbcb.umd.edu/mhap/raw/ecoliP6Oxford.tar.gz
tar xvzf mix.tar.gz

Now you can assemble all the data:

canu \
 -p ecoli -d ecoli-mix \
 genomeSize=4.8m \
 -pacbio-raw pacbio*fastq.gz \
 -nanopore-raw oxford.fasta.gz

Assembling Low Coverage Datasets

When you have 30X or less coverage, it helps to adjust the Canu assembly parameters. Typically, assembly 20X of single-molecule data outperforms hybrid methods with higher coverage. You can download a 20X subset of S. cerevisae

or use the following curl command:

curl -L -o yeast.20x.fastq.gz http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.20x.fastq.gz

and run the assembler adding sensitive parameters (correctedErrorRate=0.105):

canu \
 -p asm -d yeast \
 genomeSize=12.1m \
 correctedErrorRate=0.105 \
 -pacbio-raw yeast.20x.fastq.gz

After the run completes, we can check the assembly statistics:

tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.ctgStore 2 -G yeast/unitigging/asm.gkpStore
lenSuggestRepeat sum     160297 (genomeSize 12100000)
lenSuggestRepeat num         12
lenSuggestRepeat ave      13358
lenUnassembled ng10       13491 bp   lg10      77   sum    1214310 bp
lenUnassembled ng20       11230 bp   lg20     176   sum    2424556 bp
lenUnassembled ng30        9960 bp   lg30     290   sum    3632411 bp
lenUnassembled ng40        8986 bp   lg40     418   sum    4841978 bp
lenUnassembled ng50        8018 bp   lg50     561   sum    6054460 bp
lenUnassembled ng60        7040 bp   lg60     723   sum    7266816 bp
lenUnassembled ng70        6169 bp   lg70     906   sum    8474192 bp
lenUnassembled ng80        5479 bp   lg80    1114   sum    9684981 bp
lenUnassembled ng90        4787 bp   lg90    1348   sum   10890099 bp
lenUnassembled ng100       4043 bp   lg100   1624   sum   12103239 bp
lenUnassembled ng110       3323 bp   lg110   1952   sum   13310167 bp
lenUnassembled ng120       2499 bp   lg120   2370   sum   14520362 bp
lenUnassembled ng130       1435 bp   lg130   2997   sum   15731198 bp
lenUnassembled sum   16139888 (genomeSize 12100000)
lenUnassembled num       3332
lenUnassembled ave       4843
lenContig ng10      770772 bp   lg10       2   sum    1566457 bp
lenContig ng20      710140 bp   lg20       4   sum    3000257 bp
lenContig ng30      669248 bp   lg30       5   sum    3669505 bp
lenContig ng40      604859 bp   lg40       7   sum    4884914 bp
lenContig ng50      552911 bp   lg50      10   sum    6571204 bp
lenContig ng60      390415 bp   lg60      12   sum    7407061 bp
lenContig ng70      236725 bp   lg70      16   sum    8521520 bp
lenContig ng80      142854 bp   lg80      23   sum    9768299 bp
lenContig ng90       94308 bp   lg90      33   sum   10927790 bp
lenContig sum   12059140 (genomeSize 12100000)
lenContig num         56
lenContig ave     215341

Consensus Accuracy

While Canu corrects sequences and has 99% identity or greater with PacBio or Nanopore sequences, for the best accuracy we recommend polishing with a sequence-specific tool. We recommend Quiver for PacBio and Nanopolish for Oxford Nanpore data.

If you have Illumina sequences available, Pilon can also be used to polish either PacBio or Oxford Nanopore assemblies.

Futher Reading

See the FAQ page for commonly-asked questions and the release. notes page for information on what’s changed and known issues.