Maize Genome Assembly with PacBio Reads

Ying  Hu; Marcio F. R.  Resende Jr.

doi:10.21769/BioProtoc.4456

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Peer-reviewed

Maize Genome Assembly with PacBio Reads

YH Ying Hu email

MJ Marcio F. R. Resende Jr. email

Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4456 Views: 907

Edited by: Sanzhen Liu Reviewed by: Weijia Su Yong-Xin Liu

PDF

Ask a question

Favorite

Cited by

Abstract

Assembly of high-quality genomes is critical for the characterization of structural variations (SV), for the development of a high-resolution map of polymorphisms, and to serve as the foundation for gene annotation. In recent years, the advent of high-quality, long-read sequencing has enabled an affordable generation of highly contiguous de novo assemblies, as evidenced by the release of many reference genomes, including from species with large and complex genomes. The long-read sequencing technology is instrumental in accurately profiling highly abundant repetitive sequences, which otherwise challenge sequence alignment and assembly programs in eukaryotic genomes. In this protocol, we describe a step-by-step pipeline to assemble a maize genome with PacBio long reads using Canu, and polish the genome using Arrow and ntEdit. Our protocol provides an optional procedure for genome assembly, and could be adapted for other plant species.

Keywords: Long-read sequencing

Background

Maize is one of the most important crops in the world, and has a long history serving as a classical model organism in genetic studies. As a diploid with 10 chromosomes, approximately 85% of the maize genome is composed of transposable elements (TEs) (Schnable et al., 2009). Such abundant, repetitive, and mobile sequences pose computational challenges for accurately assembling the maize genome sequence. The first draft of the maize genome, released in 2009, was sequenced based on Sanger sequencing of bacterial artificial chromosomes and fosmids (Schnable et al., 2009). Since then, the long-read sequencing technologies, such as PacBio and Oxford Nanopore, have greatly contributed to improving maize genome assemblies. The approach generates reads with lengths of tens of kilobases, making it suitable to improve the genome continuity, close the gaps in the current reference genomes, and identify the structural variations between genomes. In recent years, high-quality genome assemblies of more than thirty maize inbred lines, based mostly using PacBio sequencing, have been released (Jiao et al., 2017; Sun et al., 2018; Springer et al., 2018; Yang et al., 2019; Haberer et al., 2020; Hufford et al., 2021; Hu et al., 2021; Lin et al., 2021).

Compared to Illumina short reads, PacBio long reads usually have a relatively higher error rate, although recent improvements in chemistry and base-calling algorithms have significantly improved long-read sequencing quality. Furthermore, a large proportion of sequencing errors tend to be randomly distributed (Korlach et al., 2013), which can be corrected by increasing the sequencing coverage or by polishing the assembly Illumina short reads with higher accuracy. Nowadays, several assembly tools have been designed for PacBio long reads, including Canu (Koren et al., 2017), Falcon (Chin et al., 2016), and WTDBG2 (Ruan and Li, 2020). This protocol describes the step-by-step pipeline to assemble a maize genome with PacBio long reads using Canu version 1.8, and polish the genome using Arrow and ntEdit (Warren et al., 2019). This approach was previously used for the sweet corn genome assembly (Hu et al., 2021). Other protocols in the literature are available, and have also resulted in high-quality assemblies, such as using Falcon to correct PacBio subreads, Canu version 1.8 for trimming and assembly, and Pilon (Walker et al., 2014) for genome polishing (Hufford et al., 2021). To be noticed, this protocol will be specifically applicable to genome assembly using traditional PacBio long reads, rather than the PacBio HiFi reads generated by the PacBio Sequel System.

Software

All the software can be downloaded/used from following locations:

SMRT Tools (version 10.1.0; https://www.pacb.com/support/software-downloads/)
SequelTools (Hufnagel et al., 2020; version 1.1.0; https://github.com/ISUgenomics/SequelTools)
Canu (Koren et al., 2017; version 1.8; https://canu.readthedocs.io/en/latest/)
BUSCO (Simão et al., 2015; version 3.0.2; https://busco.ezlab.org/)
Pbalign (version 0.3.2; https://smrt.lbi.iq.usp.br/smrtanalysis/doc/bioinformatics-tools/pbalign/doc/howto.html)
Sambamba (Tarasov et al., 2015; version 0.6.9; https://github.com/biod/sambamba)
Samtools (Li et al., 2009; version 1.12; https://github.com/samtools/samtools)
Arrow (version 2.3.3; https://smrt.lbi.iq.usp.br/smrtanalysis/doc/bioinformatics-tools/GenomicConsensus/doc/index.html )
ntHits (version 1.2.1; https://github.com/bcgsc/ntHits)
ntEdit (Warren et al., 2019; version 1.2.1; https://github.com/bcgsc/ntEdit)

Case study

A workflow of given pipelines is shown in Figure 1. The raw file (PacBio BAM files) is subjected to three major steps: (i) Pre-processing of the PacBio raw reads: First, convert the PacBio raw BAM files to fastq files, using bam2fastq. Then, check the quality metrics of fastq files using SequalTools. (ii) Genome assembly, which includes the three phases of Canu (correction, trimming, and assembly). (iii) Genome assembly polishing using Arrow and ntEdit, and quality assessment using BUSCO. The protocol provides the general instruction of each software and useful tips for genome assembly and polishing. The running time of each step of Canu will depend on user’s dataset and computing power. Canu ran for 21 days to finish a maize genome assembly (~2.3 Gb) on a 32-processor server with 187 Gb RAM.

Figure 1. Flowchart showing various steps for pre-processing of the PacBio raw reads, genome assembly, and polishing.

The three major steps are described in this flowchart. Programs/software/algorithms used are indicated next to the arrows in blue.

Pre-processing of the PacBio raw reads

This protocol uses the data files generated by the PacBio Sequel System, to show how to perform pre-processing of the PacBio raw reads. The raw data of each SMRT-cell include files named *.subreads.bam, *.subreads.pbi, and *.subreadset.xml. One subreads.bam file contains multiple copies of subreads, generated from the single SMRTBell from high-quality regions. It is analysis-ready, and will be used directly for the following analysis. Subreads containing unaligned base calls outside of high-quality regions, or excised adapter and barcode sequences are retained in a scraps.bam file.

Convert the *.subreads.bam files to fastq or fasta files, with the PacBio tool bam2fastq or bam2fasta, which is part of the free SMRT Tools.

$ bam2fastq -c 9 -o raw_PacBio_1 raw_PacBio_1.subreads.bam

$ bam2fasta -c 9 -o raw_PacBi_1 raw_PacBio_1.subreads.bam

We use -c 9 to get all the subreads, and then let the assembler decide which reads are good for genome assembly. The command bam2fastq will generate a fastq file (raw_PacBio_1.fastq, in our example), and the command bam2fasta will generate a fasta file (raw_PacBio_1.fasta, in our example). To be noticed, only fastq files will be used for the downstream analysis.

Quality check: It is necessary to perform appropriate quality checking on the PacBio sequencing data, for producing successful downstream bioinformatics analytical results. FastQC (Andrews, 2010) works well for quality control of the short reads, but is not suitable for quality control of the PacBio long reads, which do not have a meaningful Phred quality score. Therefore, we use SequelTools, to perform quality control of the PacBio Sequel raw sequencing data from multiple SMRTcells. This tool will provide several statistics for each SMRTcell, including the number of reads, total bases, mean and median read length, N50, L50, PSR (polymerase-to-subread ratio), and ZOR (ZMW-occupancy-ratio). PSR is used to determine the effectiveness of library preparation, and ZOR is used to measure the effectiveness of introducing template into the ZMW. The QC tool of SequelTools requires *.subreads.bam files, and *scraps.bam files are optional. While SequelTools will take longer with the *.scraps.bam files, more information will be provided by the *.scraps.bam files for QC plots.

Generate a file with a list of locations of *.subreads.bam files and *.scraps.bam files.

$ find $PWD/*.subreads.bam > subFiles.txt

$ find $PWD/*.scraps.bam > scrFiles.txt

In the above commands, $PWD is an environment variable that stores the path of the current directory.

Run the QC tool of SequelTools with *.scraps.bam files.

$ ./SequelTools.sh -t Q -u subFiles.txt -c scrFiles.txt

Run the QC tool of SequelTools without *.scraps.bam files.

$ ./SequelTools.sh -t Q -u subFiles.txt

The argument -t is mandatory to specify which tool is being used. We use -t Q to use the QC tool specifically. The argument -u is also mandatory to identify a file listing the locations of the subread BAM files. The argument -c is optional to identify a file listing the locations of the scraps BAM files. To be noticed, SequelTools requires Samtools, R, and Python (version 2 or 3) pre-installed in the path.

Genome assembly

To perform the maize genome assembly, we provide instructions for the Canu version 1.8 that was used for the sweet corn genome assembly (Hu et al., 2021). Canu assembles PacBio or Oxford Nanopore sequences in three phases: correction, trimming, and assembly. The recommended coverage for eukaryotic genomes is between 30 x and 60 x. Here, we will use traditional PacBio long reads to show how to perform the genome assembly using Canu. If the users have Oxford Nanopore or PacBio HiFi reads, we suggest referring to the software’s manual of HiCanu for further information and troubleshooting (Nurk et al., 2020).

Canu is a very user-friendly tool for genome assembly. First, the users do not need to worry about abnormal termination of their Canu jobs. Canu can detect where it stops, and resume the incomplete jobs automatically. For example, some jobs in the cormhap step (generating correction overlaps in the correction phase) may be killed due to job timeout. If that happens, the user can manually increase the walltime, and rerun the original Canu command. Canu will find those incomplete jobs and resubmit them automatically. Secondly, Canu does not require the upfront definition of computational resource allocation. If the users are unsure how much to allocate to the job, the software will detect the available memory and processors, and request resources based on the genome size of their assembly. If there are not enough resources to do the assembly, Canu will not start. The threads parameters using maxMemory and maxThreads can also be used to limit the amount of memory and threads used. Finally, Canu can automatically submit it to the grid, for execution in a grid environment by default. If no grid is detected, or if the user sets useGrid=false, Canu will run on a single local machine.

Canu supports sequence inputs in FASTA or FASTQ format, as well as the compressed (.gz, .bz2, or .xz) version of these formats. It can automatically perform correction, trimming, and assembly in series by default. However, the users can also run these three phases separately, if they want to test different parameters of each phase, or if they only want to run trimming and assembly phases, using corrected reads generated from other software. In this protocol, we will show how to run each phase of Canu separately, and what kind of parameters of each phase can be adjusted. If the users want to run those three phases automatically by default, please refer to the software’s manual for further information.

Correct the raw reads

In this phase, Canu will do multiple rounds of overlapping and correction. To run the correction phase specifically, the users need to use the -pacbio-raw option, to provide raw PacBio reads as input data, and use the -correct option, to let Canu only correct the raw reads. If the users have more than 4,096 input files, they must consolidate them into fewer files. The output of the correction phase will be one compressed fasta file with all corrected reads (maize.correctedReads.fasta.gz, in our example).

$ canu -correct \

-p maize -d maize \

genomeSize=2.3g \

-pacbio-raw raw_PacBio_1.fastq \

raw_PacBio_2.fastq \

raw_PacBio_3.fastq \

raw_PacBio_4.fastq \

raw_PacBio_5.fastq \

raw_PacBio_6.fastq \

raw_PacBio_7.fastq \

raw_PacBio_8.fastq \

raw_PacBio_9.fastq

The -p <string> option is mandatory to set the file name prefix of intermediate and output files. The -d <assembly directory> is optional. If it is not provided, Canu will run in the current directory. The genomeSize parameter is required by Canu, which will be used to determine coverage in the input reads. The users can provide the estimated genome size in bases, or with common SI prefixes.

[Tip 1] If the raw PacBio coverage is low (less than 30 x), one option is to increase the parameter correctedErrorRate (the allowed difference in an overlap between two corrected reads, expressed as fraction error) to 0.105 (the default value is 0.045). The parameter corMinCoverage (limits read correction to regions with at least this minimum coverage) will be automatically set up as 0 x. If the raw PacBio coverage is high (more than 60 x), a better correction will be observed if the parameter correctedErrorRate is reduced to 0.040. The parameter corMinCoverage will be automatically set up as 4 x.

[Tip 2] If the users have high raw PacBio coverage, they can consider increasing the parameters minReadLength (reads shorter than this are not loaded into the assembler), and minOverlapLength (overlaps shorter than this will not be discovered), to discard the short reads, and reads with short overlaps, to improve the assembly quality.

[Tip 3] If the users’ genome is very heterozygous, they can increase the parameter corOutCoverage (only corrects the longest reads up to this coverage) higher than the raw PacBio coverage. In that case, they will correct all the raw reads. However, when we test this parameter in our maize genome assembly, it does not improve the assembly a lot and also increases running time. Therefore, if the genome is not very heterozygous, we do not recommend changing the default value of corOutCoverage.

[Tip 4] Canu runs in two modes: locally, using just the local machine, or grid-supported, using multiple hosts managed by a grid engine, such as the Portable Batch System (PBS Pro) by default. The grid engine works as a job scheduler. After the users submit the initial job, the grid engine will queue and run them, based on the resources and genome size they are assembling. By default, Canu will automatically detect the users’ system for grid support, and submit itself to the grid for execution. If they want to specify their grid options, they can use parameter gridOptions ="<your options list>", to provide memory and time limits, and account information. For example, gridOptions="--mem=100gb --time=168:00:00 --qos=account_name" is asking memory for 100gb, time limits for 168 hours, and specify account information to every job submitted by Canu. However, we do not recommend the users to define memory and time limits because Canu will always reserve their defined memory resources and time limits for every job. Each step of the three phases requires different memory and time to be finished. If the users request too much memory in gridOptions, most of their jobs are not using that much, so their assembly will spend more time waiting to run than actually running. That being said, to disable grid support, users must specify useGrid=false to run Canu on the local machine.

Trim the corrected reads

The trim phase will decide the high-quality regions using overlapping reads, and remove the remaining SMRTbell adapter sequences. The input data should be the output of the correction phase. The users need to use the -pacbio-corrected option, to provide the corrected PacBio reads as input data, and use the -trim option, to let Canu only trim the corrected reads. The output of the trimming phase will be one compressed fasta file with all corrected and trimmed reads (maize.trimmedReads.fasta.gz, in our example).

$ canu -trim \

-p maize -d maize \

genomeSize=2.3g \

-pacbio-corrected maize/maize.correctedReads.fasta.gz

[Tip 5] If the users have high PacBio coverage (>50 x), they could speed up the trimming phase, by increasing the minimum coverage and overlaps, to perform more stringent overlap-based trimming. The users can add the parameter (trimReadsCoverage=2 trimReadsOverlap=500) if they have >50 x coverage. The parameter trimReadsCoverage and trimReadsOverlap are used to define minimum depth of evidence, to retain bases and minimum overlap between evidence to make a contiguous trim.

Assemble the corrected and trimmed reads into unitigs

The assembly phase will identify the consistent overlaps, order and orient reads into contigs, and generate a consensus sequence for the unitig. The output of the trimming phase will be used for unitig construction. The users need the -pacbio-corrected option, to provide corrected and trimmed PacBio reads as input data, and use the -assemble option, to let Canu only assemble the corrected and trimmed reads. Canu will generate three assembled sequences, including maize.contigs.fasta, maize.unitigs.fasta, and maize.unassembled.fasta, wherein the maize.contigs.fasta is the primary output.

$ canu -assemble \

-p maize -d maize \

genomeSize=2.3g \

-pacbio-corrected maize/maize.trimmedReads.fasta.gz

[Tip 6] There are several parameters that may need tweaking to get the best genome assembly. First, the users can use different correctedErrorRate, to test the effect of different stringency on overlaps to be used on the assembly quality. We recommend setting up correctedErroRate as 0.035 for low coverage data (<30 x), and 0.055 for high coverage data (>50 x). Second, utgOvlErrorRate (overlaps generated for assembling reads above this error rate are not computed) is another parameter that needs tweaking. If set too high, it will result in errors in genome assembly and increase the running time, but, if set too low, real overlaps between low-quality reads will be missed, resulting in truncated genome assembly. We recommend setting up utgOvlErroRate as 0.035 for low coverage data (<30 x), and 0.055 for high coverage data (>50 x).

Assembly polishing
1. To improve the accuracy of the genome assembly, Arrow will be used to polish the contigs assembled from the Sequel System data, by mapping a set of raw PacBio raw reads to the contigs, and building a consensus of this read mapping. The variantCaller provided by GenomicConsensus package is the command line tool to call Arrow algorithm, to get consensus and variant calling on the mapped reads.
  1. First, align the raw PacBio reads (*.subreads.bam files) to the assembled genome sequence, using pbalign with the following command:
  2. If the users have multiple bam files, they can use sambamba to merge those aligned bam files into one. For instance, merge nine aligned bam files into one, as follows:
  3. Before polishing the assembled genome sequence, the reference genome should be indexed with samtools faidx.
  4. Run the variantCaller command line tool to call Arrow on the merged and aligned bam files, if the users have multiple bam files, or on a single aligned bam file, if the users have only one bam file. The following command is used for call Arrow on merged and aligned bam files.
1. It is highly recommended to use high-quality Illumina short-read data, to further polish the assembled genome sequence. Pilon is a commonly used pipeline to perform sequence error correction (Walker et al., 2014). In this protocol, we will use another pipeline called ntEdit to polish the assembled genome sequence. It is a bloom filter k-mer based approach that significantly reduces the running time, and makes fewer mistakes compared to Pilon (Warren et al., 2019).
  1. First, run the tool ntHits to split the Illumina short reads into kmers. The kmers that pass the coverage thresholds will be used to build a bloom filter (BF).
  2. Then, ntEdit will polish the Arrow-polished contigs from the assembled genome sequence, based on BF data.
1. Quality assessment

Result interpretation

In order to assemble the Ia453-sh2 genome, 150.5 Gb (~70-fold coverage, 19.9 million reads) of PacBio single-molecule long reads were self-corrected and assembled with Canu, generating 15,550 contigs with an N50 of 0.39 Mb (Table 1). The quality and completeness of the Ia453-sh2 genome was evaluated through BUSCO. The BUSCO results are similar to what was obtained for field corn reference genomes, such as B73 v4 (Jiao et al., 2017), W22 (Springer et al., 2018), and Mo17 (Sun et al., 2018), indicating similar continuity and completeness of our assembly.

Discussion

This protocol is focusing on introducing how to assemble a maize genome with traditional PacBio long reads, using Canu version 1.8. However, for large and complex plant genomes, the assembly produced by Canu is usually fragmented, and requires additional scaffolding methods to improve genome assembly. In our recent sweet corn genome assembly study (Hu et al., 2021), other two data sources, including BioNano optical maps, and Dovetail Hi–C mapping, were used to generate high-quality and complete genome assembly. BioNano optical maps dramatically improved the genome assembly, by anchoring 15,550 PacBio contigs into 29 super scaffolds and 8486 unscaffolded contigs, and increased N50 from 0.39 Mb to 120.9 Mb. To further anchor and orient the super scaffolds and unscaffolded contigs into pseudochromosomes, Dovetail Hi–C mapping was used for scaffolding through a hierarchical clustering strategy. The final assembly has a genome length of 2.29 Gb, and contains 10 pseudochromosomes, with a total length of 2.11 Gb, as well as 8440 unassigned contigs, with a total length of 177.23 Mb. Therefore, single-molecule real-time (SMRT) long-read sequencing, combined with BioNano optical mapping and Dovetail Hi–C mapping technologies, helped us assemble a high-quality reference genome of sweet corn. For a detailed description of the method and parameters of BioNano optical mapping and Dovetail Hi-C, the users are referred to our sweet corn study (Hu et al., 2021).

Acknowledgments

This work was supported by the National Institute of Food and Agriculture (SCRI 2018-51181-28419 to M.F.R.R.).

Competing interests

The authors declare no conflicts of interest.

References

Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. 04-10-18: Version 0.11.8 released.
Chin, C. S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O'Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., et al. (2016). Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13(12): 1050-1054.
Haberer, G., Kamal, N., Bauer, E., Gundlach, H., Fischer, I., Seidel, M. A., Spannagl, M., Marcon, C., Ruban, A., Urbany, C., et al. (2020). European maize genomes highlight intraspecies variation in repeat and gene content. Nat Genet 52(9): 950-957.
Hufnagel, D. E., Hufford, M. B. and Seetharam, A. S. (2020). SequelTools: a suite of tools for working with PacBio Sequel raw sequence data. BMC Bioinformatics 21(1): 429.
Hufford, M. B., Seetharam, A. S., Woodhouse, M. R., Chougule, K. M., Ou, S., Liu, J., Ricci, W. A., Guo, T., Olson, A., Qiu, Y., et al. (2021). De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373(6555): 655-662.
Hu, Y., Colantonio, V., Muller, B. S. F., Leach, K. A., Nanni, A., Finegan, C., Wang, B., Baseggio, M., Newton, C. J., Juhl, E. M., et al. (2021). Genome assembly and population genomic analysis provide insights into the evolution of modern sweet corn. Nat Commun 12(1): 1227.
Jiao, Y., Peluso, P., Shi, J., Liang, T., Stitzer, M. C., Wang, B., Campbell, M. S., Stein, J. C., Wei, X., Chin, C. S., et al. (2017). Improved maize reference genome with single-molecule technologies. Nature 546(7659): 524-527.
Korlach, J., Officer, C. S. and Biosciences, P. (2013). Understanding Accuracy in SMRT® Sequencing. Technical Report.
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27(5): 722-736.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079.
Lin, G., He, C., Zheng, J., Koo, D. H., Le, H., Zheng, H., Tamang, T. M., Lin, J., Liu, Y., Zhao, M., et al. (2021). Chromosome-level genome assembly of a regenerable maize inbred line A188. Genome Biol 22(1): 175.
Nurk, S., Walenz, B. P., Rhie, A., Vollger, M. R., Logsdon, G. A., Grothe, R., Miga, K. H., Eichler, E. E., Phillippy, A. M. and Koren, S. (2020). HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 30(9): 1291-1305.
Ruan, J. and Li, H. (2020). Fast and accurate long-read assembly with wtdbg2. Nat Methods 17(2): 155-158.
Schnable, P. S., Ware, D., Fulton, R. S., Stein, J. C., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T. A., et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 326(5956): 1112-1115.
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. and Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19): 3210-3212.
Sun, S., Zhou, Y., Chen, J., Shi, J., Zhao, H., Zhao, H., et al. (2018). Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat Genet 50(9): 1289-1295.
Springer, N. M., Anderson, S. N., Andorf, C. M., Ahern, K. R., Bai, F., Barad, O., Barbazuk, W. B., Bass, H. W., Baruch, K., Ben-Zvi, G., et al. (2018). The maize W22 genome provides a foundation for functional genomics and transposon biology. Nat Genet 50(9): 1282-1288.
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. and Prins, P. (2015). Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12): 2032-2034.
Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., Cuomo, C. A., Zeng, Q., Wortman, J., Young, S. K. and Earl, A. M. (2014). Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9(11): e112963.
Warren, R. L., Coombe, L., Mohamadi, H., Zhang, J., Jaquish, B., Isabel, N., Jones, S. J. M., Bousquet, J., Bohlmann, J. and Birol, I. (2019). ntEdit: scalable genome sequence polishing. Bioinformatics 35(21): 4430-4432.
Yang, N., Liu, J., Gao, Q., Gui, S., Chen, L., Yang, L., Huang, J., Deng, T., Luo, J., He, L., et al. (2019). Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat Genet 51(6):1052-1059.

Supplementary information

Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio-protocol/Maize_Genome_Assembly_With_PacBio_Reads.

Please login or sign up for free to view full text

Genomic feature	Assembly
Length of genome assembly (bp)	2,258,407,602
Max length (bp)	2,583,225
Contig N50 (bp)	385,558
Contig N90 (bp)	62,492
Number of contigs	15,550