Advanced Search
Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4456 Views: 907
Edited by: Sanzhen Liu Reviewed by: Weijia SuYong-Xin Liu
Abstract
Assembly of high-quality genomes is critical for the characterization of structural variations (SV), for the development of a high-resolution map of polymorphisms, and to serve as the foundation for gene annotation. In recent years, the advent of high-quality, long-read sequencing has enabled an affordable generation of highly contiguous de novo assemblies, as evidenced by the release of many reference genomes, including from species with large and complex genomes. The long-read sequencing technology is instrumental in accurately profiling highly abundant repetitive sequences, which otherwise challenge sequence alignment and assembly programs in eukaryotic genomes. In this protocol, we describe a step-by-step pipeline to assemble a maize genome with PacBio long reads using Canu, and polish the genome using Arrow and ntEdit. Our protocol provides an optional procedure for genome assembly, and could be adapted for other plant species.
Keywords: Long-read sequencingBackground
Maize is one of the most important crops in the world, and has a long history serving as a classical model organism in genetic studies. As a diploid with 10 chromosomes, approximately 85% of the maize genome is composed of transposable elements (TEs) (Schnable et al., 2009). Such abundant, repetitive, and mobile sequences pose computational challenges for accurately assembling the maize genome sequence. The first draft of the maize genome, released in 2009, was sequenced based on Sanger sequencing of bacterial artificial chromosomes and fosmids (Schnable et al., 2009). Since then, the long-read sequencing technologies, such as PacBio and Oxford Nanopore, have greatly contributed to improving maize genome assemblies. The approach generates reads with lengths of tens of kilobases, making it suitable to improve the genome continuity, close the gaps in the current reference genomes, and identify the structural variations between genomes. In recent years, high-quality genome assemblies of more than thirty maize inbred lines, based mostly using PacBio sequencing, have been released (Jiao et al., 2017; Sun et al., 2018; Springer et al., 2018; Yang et al., 2019; Haberer et al., 2020; Hufford et al., 2021; Hu et al., 2021; Lin et al., 2021).
Compared to Illumina short reads, PacBio long reads usually have a relatively higher error rate, although recent improvements in chemistry and base-calling algorithms have significantly improved long-read sequencing quality. Furthermore, a large proportion of sequencing errors tend to be randomly distributed (Korlach et al., 2013), which can be corrected by increasing the sequencing coverage or by polishing the assembly Illumina short reads with higher accuracy. Nowadays, several assembly tools have been designed for PacBio long reads, including Canu (Koren et al., 2017), Falcon (Chin et al., 2016), and WTDBG2 (Ruan and Li, 2020). This protocol describes the step-by-step pipeline to assemble a maize genome with PacBio long reads using Canu version 1.8, and polish the genome using Arrow and ntEdit (Warren et al., 2019). This approach was previously used for the sweet corn genome assembly (Hu et al., 2021). Other protocols in the literature are available, and have also resulted in high-quality assemblies, such as using Falcon to correct PacBio subreads, Canu version 1.8 for trimming and assembly, and Pilon (Walker et al., 2014) for genome polishing (Hufford et al., 2021). To be noticed, this protocol will be specifically applicable to genome assembly using traditional PacBio long reads, rather than the PacBio HiFi reads generated by the PacBio Sequel System.
Software
All the software can be downloaded/used from following locations:
SMRT Tools (version 10.1.0; https://www.pacb.com/support/software-downloads/)
SequelTools (Hufnagel et al., 2020; version 1.1.0; https://github.com/ISUgenomics/SequelTools)
Canu (Koren et al., 2017; version 1.8; https://canu.readthedocs.io/en/latest/)
BUSCO (Simão et al., 2015; version 3.0.2; https://busco.ezlab.org/)
Pbalign (version 0.3.2; https://smrt.lbi.iq.usp.br/smrtanalysis/doc/bioinformatics-tools/pbalign/doc/howto.html)
Sambamba (Tarasov et al., 2015; version 0.6.9; https://github.com/biod/sambamba)
Samtools (Li et al., 2009; version 1.12; https://github.com/samtools/samtools)
Arrow (version 2.3.3; https://smrt.lbi.iq.usp.br/smrtanalysis/doc/bioinformatics-tools/GenomicConsensus/doc/index.html )
ntHits (version 1.2.1; https://github.com/bcgsc/ntHits)
ntEdit (Warren et al., 2019; version 1.2.1; https://github.com/bcgsc/ntEdit)
Case study
A workflow of given pipelines is shown in Figure 1. The raw file (PacBio BAM files) is subjected to three major steps: (i) Pre-processing of the PacBio raw reads: First, convert the PacBio raw BAM files to fastq files, using bam2fastq. Then, check the quality metrics of fastq files using SequalTools. (ii) Genome assembly, which includes the three phases of Canu (correction, trimming, and assembly). (iii) Genome assembly polishing using Arrow and ntEdit, and quality assessment using BUSCO. The protocol provides the general instruction of each software and useful tips for genome assembly and polishing. The running time of each step of Canu will depend on user’s dataset and computing power. Canu ran for 21 days to finish a maize genome assembly (~2.3 Gb) on a 32-processor server with 187 Gb RAM.
The three major steps are described in this flowchart. Programs/software/algorithms used are indicated next to the arrows in blue.
Pre-processing of the PacBio raw reads
This protocol uses the data files generated by the PacBio Sequel System, to show how to perform pre-processing of the PacBio raw reads. The raw data of each SMRT-cell include files named *.subreads.bam, *.subreads.pbi, and *.subreadset.xml. One subreads.bam file contains multiple copies of subreads, generated from the single SMRTBell from high-quality regions. It is analysis-ready, and will be used directly for the following analysis. Subreads containing unaligned base calls outside of high-quality regions, or excised adapter and barcode sequences are retained in a scraps.bam file.
Convert the *.subreads.bam files to fastq or fasta files, with the PacBio tool bam2fastq or bam2fasta, which is part of the free SMRT Tools.
$ bam2fastq -c 9 -o raw_PacBio_1 raw_PacBio_1.subreads.bam
$ bam2fasta -c 9 -o raw_PacBi_1 raw_PacBio_1.subreads.bam
We use -c 9 to get all the subreads, and then let the assembler decide which reads are good for genome assembly. The command bam2fastq will generate a fastq file (raw_PacBio_1.fastq, in our example), and the command bam2fasta will generate a fasta file (raw_PacBio_1.fasta, in our example). To be noticed, only fastq files will be used for the downstream analysis.
Quality check: It is necessary to perform appropriate quality checking on the PacBio sequencing data, for producing successful downstream bioinformatics analytical results. FastQC (Andrews, 2010) works well for quality control of the short reads, but is not suitable for quality control of the PacBio long reads, which do not have a meaningful Phred quality score. Therefore, we use SequelTools, to perform quality control of the PacBio Sequel raw sequencing data from multiple SMRTcells. This tool will provide several statistics for each SMRTcell, including the number of reads, total bases, mean and median read length, N50, L50, PSR (polymerase-to-subread ratio), and ZOR (ZMW-occupancy-ratio). PSR is used to determine the effectiveness of library preparation, and ZOR is used to measure the effectiveness of introducing template into the ZMW. The QC tool of SequelTools requires *.subreads.bam files, and *scraps.bam files are optional. While SequelTools will take longer with the *.scraps.bam files, more information will be provided by the *.scraps.bam files for QC plots.
Generate a file with a list of locations of *.subreads.bam files and *.scraps.bam files.
$ find $PWD/*.subreads.bam > subFiles.txt
$ find $PWD/*.scraps.bam > scrFiles.txt
In the above commands, $PWD is an environment variable that stores the path of the current directory.
Run the QC tool of SequelTools with *.scraps.bam files.
$ ./SequelTools.sh -t Q -u subFiles.txt -c scrFiles.txt
Run the QC tool of SequelTools without *.scraps.bam files.
$ ./SequelTools.sh -t Q -u subFiles.txt
The argument -t is mandatory to specify which tool is being used. We use -t Q to use the QC tool specifically. The argument -u is also mandatory to identify a file listing the locations of the subread BAM files. The argument -c is optional to identify a file listing the locations of the scraps BAM files. To be noticed, SequelTools requires Samtools, R, and Python (version 2 or 3) pre-installed in the path.
Genome assembly
To perform the maize genome assembly, we provide instructions for the Canu version 1.8 that was used for the sweet corn genome assembly (Hu et al., 2021). Canu assembles PacBio or Oxford Nanopore sequences in three phases: correction, trimming, and assembly. The recommended coverage for eukaryotic genomes is between 30 x and 60 x. Here, we will use traditional PacBio long reads to show how to perform the genome assembly using Canu. If the users have Oxford Nanopore or PacBio HiFi reads, we suggest referring to the software’s manual of HiCanu for further information and troubleshooting (Nurk et al., 2020).
Canu is a very user-friendly tool for genome assembly. First, the users do not need to worry about abnormal termination of their Canu jobs. Canu can detect where it stops, and resume the incomplete jobs automatically. For example, some jobs in the cormhap step (generating correction overlaps in the correction phase) may be killed due to job timeout. If that happens, the user can manually increase the walltime, and rerun the original Canu command. Canu will find those incomplete jobs and resubmit them automatically. Secondly, Canu does not require the upfront definition of computational resource allocation. If the users are unsure how much to allocate to the job, the software will detect the available memory and processors, and request resources based on the genome size of their assembly. If there are not enough resources to do the assembly, Canu will not start. The threads parameters using maxMemory and maxThreads can also be used to limit the amount of memory and threads used. Finally, Canu can automatically submit it to the grid, for execution in a grid environment by default. If no grid is detected, or if the user sets useGrid=false, Canu will run on a single local machine.
Canu supports sequence inputs in FASTA or FASTQ format, as well as the compressed (.gz, .bz2, or .xz) version of these formats. It can automatically perform correction, trimming, and assembly in series by default. However, the users can also run these three phases separately, if they want to test different parameters of each phase, or if they only want to run trimming and assembly phases, using corrected reads generated from other software. In this protocol, we will show how to run each phase of Canu separately, and what kind of parameters of each phase can be adjusted. If the users want to run those three phases automatically by default, please refer to the software’s manual for further information.
Correct the raw reads
In this phase, Canu will do multiple rounds of overlapping and correction. To run the correction phase specifically, the users need to use the -pacbio-raw option, to provide raw PacBio reads as input data, and use the -correct option, to let Canu only correct the raw reads. If the users have more than 4,096 input files, they must consolidate them into fewer files. The output of the correction phase will be one compressed fasta file with all corrected reads (maize.correctedReads.fasta.gz, in our example).
$ canu -correct \
-p maize -d maize \
genomeSize=2.3g \
-pacbio-raw raw_PacBio_1.fastq \
raw_PacBio_2.fastq \
raw_PacBio_3.fastq \
raw_PacBio_4.fastq \
raw_PacBio_5.fastq \
raw_PacBio_6.fastq \
raw_PacBio_7.fastq \
raw_PacBio_8.fastq \
raw_PacBio_9.fastq
The -p <string> option is mandatory to set the file name prefix of intermediate and output files. The -d <assembly directory> is optional. If it is not provided, Canu will run in the current directory. The genomeSize parameter is required by Canu, which will be used to determine coverage in the input reads. The users can provide the estimated genome size in bases, or with common SI prefixes.
[Tip 1] If the raw PacBio coverage is low (less than 30 x), one option is to increase the parameter correctedErrorRate (the allowed difference in an overlap between two corrected reads, expressed as fraction error) to 0.105 (the default value is 0.045). The parameter corMinCoverage (limits read correction to regions with at least this minimum coverage) will be automatically set up as 0 x. If the raw PacBio coverage is high (more than 60 x), a better correction will be observed if the parameter correctedErrorRate is reduced to 0.040. The parameter corMinCoverage will be automatically set up as 4 x.
[Tip 2] If the users have high raw PacBio coverage, they can consider increasing the parameters minReadLength (reads shorter than this are not loaded into the assembler), and minOverlapLength (overlaps shorter than this will not be discovered), to discard the short reads, and reads with short overlaps, to improve the assembly quality.
[Tip 3] If the users’ genome is very heterozygous, they can increase the parameter corOutCoverage (only corrects the longest reads up to this coverage) higher than the raw PacBio coverage. In that case, they will correct all the raw reads. However, when we test this parameter in our maize genome assembly, it does not improve the assembly a lot and also increases running time. Therefore, if the genome is not very heterozygous, we do not recommend changing the default value of corOutCoverage.
[Tip 4] Canu runs in two modes: locally, using just the local machine, or grid-supported, using multiple hosts managed by a grid engine, such as the Portable Batch System (PBS Pro) by default. The grid engine works as a job scheduler. After the users submit the initial job, the grid engine will queue and run them, based on the resources and genome size they are assembling. By default, Canu will automatically detect the users’ system for grid support, and submit itself to the grid for execution. If they want to specify their grid options, they can use parameter gridOptions ="<your options list>", to provide memory and time limits, and account information. For example, gridOptions="--mem=100gb --time=168:00:00 --qos=account_name" is asking memory for 100gb, time limits for 168 hours, and specify account information to every job submitted by Canu. However, we do not recommend the users to define memory and time limits because Canu will always reserve their defined memory resources and time limits for every job. Each step of the three phases requires different memory and time to be finished. If the users request too much memory in gridOptions, most of their jobs are not using that much, so their assembly will spend more time waiting to run than actually running. That being said, to disable grid support, users must specify useGrid=false to run Canu on the local machine.
Trim the corrected reads
The trim phase will decide the high-quality regions using overlapping reads, and remove the remaining SMRTbell adapter sequences. The input data should be the output of the correction phase. The users need to use the -pacbio-corrected option, to provide the corrected PacBio reads as input data, and use the -trim option, to let Canu only trim the corrected reads. The output of the trimming phase will be one compressed fasta file with all corrected and trimmed reads (maize.trimmedReads.fasta.gz, in our example).
$ canu -trim \
-p maize -d maize \
genomeSize=2.3g \
-pacbio-corrected maize/maize.correctedReads.fasta.gz
[Tip 5] If the users have high PacBio coverage (>50 x), they could speed up the trimming phase, by increasing the minimum coverage and overlaps, to perform more stringent overlap-based trimming. The users can add the parameter (trimReadsCoverage=2 trimReadsOverlap=500) if they have >50 x coverage. The parameter trimReadsCoverage and trimReadsOverlap are used to define minimum depth of evidence, to retain bases and minimum overlap between evidence to make a contiguous trim.
Assemble the corrected and trimmed reads into unitigs
The assembly phase will identify the consistent overlaps, order and orient reads into contigs, and generate a consensus sequence for the unitig. The output of the trimming phase will be used for unitig construction. The users need the -pacbio-corrected option, to provide corrected and trimmed PacBio reads as input data, and use the -assemble option, to let Canu only assemble the corrected and trimmed reads. Canu will generate three assembled sequences, including maize.contigs.fasta, maize.unitigs.fasta, and maize.unassembled.fasta, wherein the maize.contigs.fasta is the primary output.
$ canu -assemble \
-p maize -d maize \
genomeSize=2.3g \
-pacbio-corrected maize/maize.trimmedReads.fasta.gz
[Tip 6] There are several parameters that may need tweaking to get the best genome assembly. First, the users can use different correctedErrorRate, to test the effect of different stringency on overlaps to be used on the assembly quality. We recommend setting up correctedErroRate as 0.035 for low coverage data (<30 x), and 0.055 for high coverage data (>50 x). Second, utgOvlErrorRate (overlaps generated for assembling reads above this error rate are not computed) is another parameter that needs tweaking. If set too high, it will result in errors in genome assembly and increase the running time, but, if set too low, real overlaps between low-quality reads will be missed, resulting in truncated genome assembly. We recommend setting up utgOvlErroRate as 0.035 for low coverage data (<30 x), and 0.055 for high coverage data (>50 x).
Assembly polishing
$ pbalign raw_PacBio_1.subreads.bam maize.contigs.fasta raw_PacBio_1.subreads_aligned.bam
Optionally, the number of CPU threads (--nproc <int>) can be set.
If the users have multiple bam files, they can use sambamba to merge those aligned bam files into one. For instance, merge nine aligned bam files into one, as follows:
$ sambamba merge raw_PacBio.subreads_aligned_merged.bam \
raw_PacBio_1.subreads_aligned.bam \
raw_PacBio_2.subreads_aligned.bam \
raw_PacBio_3.subreads_aligned.bam \
raw_PacBio_4.subreads_aligned.bam \
raw_PacBio_5.subreads_aligned.bam \
raw_PacBio_6.subreads_aligned.bam \
raw_PacBio_7.subreads_aligned.bam \
raw_PacBio_8.subreads_aligned.bam \
raw_PacBio_9.subreads_aligned.bam
Before polishing the assembled genome sequence, the reference genome should be indexed with samtools faidx.
$ samtools faidx maize.contigs.fasta
Run the variantCaller command line tool to call Arrow on the merged and aligned bam files, if the users have multiple bam files, or on a single aligned bam file, if the users have only one bam file. The following command is used for call Arrow on merged and aligned bam files.
$ variantCaller --algorithm=arrow raw_PacBio.subreads_aligned_merged.bam \
--referenceFilename maize.contigs.fasta \
-j 32 \
-o Maize.contigs.polished.arrow.fastq \
-o Maize.contigs.polished.arrow.fasta \
-o Maize.contigs.polished.arrow.gff
where --algorithm sets the algorithm as Arrow, --referenceFilename provides the file name of the assembled genome FASTA file, -j is optional to set the number of threads, and -o sets the output files. The users can generate multiple outputs with different formats, including FASTA, FASTQ, GFF, and VCF.
$ nthits -c 2 --outbloom -p maize -b 36 -k 25 -t 8 \
maize.R1.pair.fq maize.R2.pair.fq
where -c sets the maximum coverage threshold for reporting kmer. We recommend setting -c as 1 for low coverage Illumina short-read data (<20 x), 2 for coverage (20–30 x), or running with the --solid with high coverage data (>30 x), to report non-error kmers. The option --outbloom will output the coverage-thresholded kmers in a bloom filter, and option -p will set the prefix for the output file name (the name of output of ntHits is maize_k25.bf based on the above settings). The bloom filter bit size is defined by the option -b [-b 36: keeps the Bloom filter false positive rate low (~0.0005)], and the kmer size can be adjusted using the option -k. Optionally, the number of CPUs can be set (-t <int>). The input file can be two pair-end fastq files, or a file listing the path to all pair-end fastq files.
Then, ntEdit will polish the Arrow-polished contigs from the assembled genome sequence, based on BF data.
$ ntedit -f Maize.contigs.polished.arrow.fasta \
-r maize_k25.bf -k 25 -b Maize.contigs.polished.arrow.ntedit -t 24
where -f is the users’ assembled genome input, -r sets the bloom filter file generated from ntHits, -k sets the length of the kmer, and -b sets the output file prefix (the name of output of ntHits is Maize.contigs.polished.arrow.ntedit_edited.fa based on the above settings). Optionally, the number of CPUs can be set (-t <int>).
After genome assembly and genome polishing, it is necessary to check the completeness and duplication of the assembly. BUSCO is a commonly used tool to assess the completeness of the genome assembly (Simão et al., 2015). Check the newest version at https://busco.ezlab.org/. The users can run BUSCO by using the following command line:
$ run_BUSCO.py -i Maize.contigs.polished.arrow.ntedit_edited.fa \
-o PacBio_assembly.BUSCO -m geno -sp maize -l embryophyta_odb9
where -i is the assembled fasta sequence, -o is the output file name, -m sets the mode for BUSCO (geno or genome for genome assemblies, tran or transcriptome for transcriptome assemblies, or prot or proteins for annotated gene sets), and -l is the dataset used as reference for comparison [the lineage data (embryophyta_odb9) is related to maize]. The lineage data can be downloaded from the BUSCO website.
BUSCO assesses the completeness of the assembled genome, by quantitatively checking evolutionarily-informed expectations of gene content of near-universal single-copy orthologs. BUSCO will report the expected BUSCO genes in different categories: C:complete [S:single-copy, D:duplicated], F:fragmented, and M:missing. The results are reported as absolute numbers, as well as the percentage of the total BUSCO genes. In the above example, BUSCO analyzed the completeness of the assembled maize genome in terms of complete single-copy, complete duplicated, fragmented, and missing BUSCOs, using a plant-specific database (embryophyte_odb9) that consisted of 1440 total BUSCO groups from 30 species. For model species, the good assembled genome should have BUSCO score above 95% complete. The complete duplicated BUSCOs will reflect the duplications of the users’ assembly. If the users have a large amount of fragmented and missing BUSCOs, that means the users’ assembly is fragmented, and does not cover the entire genome.
Example of the protocol application – sweet corn genome assembly
The dataset provided in our original publication (Hu et al., 2021) will be used to describe this protocol. In summary, the sweet corn (Zea mays L.) inbred line Ia453 with sh2-R allele (Ia453-sh2) was sequenced. DNA from 1-week-old etiolated seedlings was extracted, using a modified CTAB method for PacBio sequencing. Large insert (20 kb) SMRTbell libraries were sequenced, using a PacBio Sequel system. DNA extracted from the same sample was used to build standard 300-bp Illumina libraries. All Illumina libraries were sequenced with 150 bp paired-end reads.
For sweet corn genome assembly (Hu et al., 2021), around 19.9 million PacBio SMRT subreads were error-corrected, and assembled using Canu version 1.8. The correction phase of Canu was run with default parameters, except for the minimum read length (-minReadLength) set to 5000, to only correct reads longer than 5 kb, and the coverage in corrected reads (-corOutCoverage) set to 60, to get more corrected reads. With the read length cutoff as 5 kb, and read coverage as 60 x, around 12.3 million reads with 134 Gb (58.26 x coverage) were used for Canu correction. After correction, 9.8 million reads with 102.5 Gb remained. The trim and assembly phase of Canu were run using the default parameters: rawErrorRate=0.300, correctedErrorRate=0.045, corMhapSensitivity=normal, corMinCoverage=4, corOutCoverage=40, minOverlapLength=500, and minReadLength=1000. After trimming, around 1.08 million reads kept the same (no trimming), 8.64 million reads were trimmed, and 30,678 and 62,457 reads were deleted, due to no overlaps or short trimmed length. The remaining 9.7 million reads with 98.6 Gb (42.86 x coverage) were used for unitig construction. Finally, the assembly phase of Canu generated a consensus sequence for the unitig. The general statistics is shown in Table 1.
Table 1. The summary statistics of the sweet corn Ia453-sh2 assembly.
Genomic feature | Assembly |
Length of genome assembly (bp) | 2,258,407,602 |
Max length (bp) | 2,583,225 |
Contig N50 (bp) | 385,558 |
Contig N90 (bp) | 62,492 |
Number of contigs | 15,550 |
Genome polishing
To improve the accuracy of the genome assembly, Arrow was used to correct the sequencing errors with default parameters. A total of 1,573,052 bases, including 999,948 bases of insertions, 57,359 bases of deletions, and 515,745 bases of substitutions were corrected.
Then, we ran ntEdit to further polish the genome assembly using ~23 x coverage of paired-end Illumina whole genome sequencing data. The ntHits was first ran with parameters “-k 25 -c 2” to build a bloom filter, which was read by ntEdit to polish the assembly with default parameters. A total of 832,323 changes were corrected, including 31.29% SNPs, and 68.7% small indels (2–25 bps).
Quality assessment
The genome completeness of the assembled genome sequence was assessed using the benchmarking universal single-copy orthologs (BUSCO) v3.02. The assembly was tested against the Plantae BUSCO “Embryophyta_odb9” database, which contained 1440 protein sequences and orthogroup annotations for major clades. BUSCO analysis showed that 94.6 % (1,363), 1.11 % (18), and 4.09 % (59) of the Plantae BUSCO genes are present in the Canu assembled Ia453-sh2 genome as complete, fragmented, and missing genes, respectively. Out of the 94.6% complete genes, 88.05% were single-copy genes, and 6.59% were duplicated genes.
Result interpretation
In order to assemble the Ia453-sh2 genome, 150.5 Gb (~70-fold coverage, 19.9 million reads) of PacBio single-molecule long reads were self-corrected and assembled with Canu, generating 15,550 contigs with an N50 of 0.39 Mb (Table 1). The quality and completeness of the Ia453-sh2 genome was evaluated through BUSCO. The BUSCO results are similar to what was obtained for field corn reference genomes, such as B73 v4 (Jiao et al., 2017), W22 (Springer et al., 2018), and Mo17 (Sun et al., 2018), indicating similar continuity and completeness of our assembly.
Discussion
This protocol is focusing on introducing how to assemble a maize genome with traditional PacBio long reads, using Canu version 1.8. However, for large and complex plant genomes, the assembly produced by Canu is usually fragmented, and requires additional scaffolding methods to improve genome assembly. In our recent sweet corn genome assembly study (Hu et al., 2021), other two data sources, including BioNano optical maps, and Dovetail Hi–C mapping, were used to generate high-quality and complete genome assembly. BioNano optical maps dramatically improved the genome assembly, by anchoring 15,550 PacBio contigs into 29 super scaffolds and 8486 unscaffolded contigs, and increased N50 from 0.39 Mb to 120.9 Mb. To further anchor and orient the super scaffolds and unscaffolded contigs into pseudochromosomes, Dovetail Hi–C mapping was used for scaffolding through a hierarchical clustering strategy. The final assembly has a genome length of 2.29 Gb, and contains 10 pseudochromosomes, with a total length of 2.11 Gb, as well as 8440 unassigned contigs, with a total length of 177.23 Mb. Therefore, single-molecule real-time (SMRT) long-read sequencing, combined with BioNano optical mapping and Dovetail Hi–C mapping technologies, helped us assemble a high-quality reference genome of sweet corn. For a detailed description of the method and parameters of BioNano optical mapping and Dovetail Hi-C, the users are referred to our sweet corn study (Hu et al., 2021).
Acknowledgments
This work was supported by the National Institute of Food and Agriculture (SCRI 2018-51181-28419 to M.F.R.R.).
Competing interests
The authors declare no conflicts of interest.
References
Supplementary information
Category
Bioinformatics and Computational Biology
Plant Science > Plant molecular biology
Biological Sciences > Biological techniques
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link