Raw PacBio reads were error corrected and assembled into contigs using CANU47 (v1.7.1) with default parameters except that ‘OvlMerThreshold’ and ‘corOutCoverage’ were set to 500 and 200, respectively. PacBio reads were then aligned to the contigs and based on the alignments errors in the assembled contigs were corrected using the Arrow program implemented in SMRT-link-5.1 (PacBio). Furthermore, the Illumina paired-end reads were processed to remove adaptor and low-quality sequences using Trimmomatic48 (v0.36). The cleaned Illumina reads were aligned to the contigs using BWA-MEM49 (v0.7.17) with default parameters, and based on the alignments two rounds of iterative error corrections were performed using Pilon50 (v1.22) with parameters ‘–fix bases–diploid’. The final error-corrected contigs were then compared against the NCBI non‐redundant nucleotide database, and those with more than 95% of their length similar to sequences of organelles (mitochondrion or chloroplast) or microorganisms (bacteria/fungi/viruses), were considered contaminants and discarded. The redundans pipeline51 (v0.14a) was then used to remove redundancies in the assembled contigs with parameters ‘--identity 0.99 --overlap 0.97’.
To scaffold the assembled contigs, Illumina reads from the Hi-C library were processed with Trimmomatic48 (v0.36) to remove adaptor and low-quality sequences. The cleaned Hi-C reads were aligned to the assembled contigs and the alignments were filtered using the Arima-HiC mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline). Based on the alignments, the contigs were clustered into pseudomolecules using SALSA52 (v2.2) with parameters ‘-e GATC -i 3’. Furthermore, contigs of LA2093 were also assembled into pseudomolecules by comparing them with the Heinz1706 reference genome20 (version 4.0) using RaGOO6 (v1.1). Inconsistencies between pseudomolecules constructed using the Hi-C data and those using the synteny information with the Heinz1706 genome were identified. The mis-joined scaffolds were manually corrected based on the Hi-C contact information, genome synteny information, and a genetic map constructed from a recombinant inbred line (RIL) population with LA2093 as one of the parents16, resulting a consensus set of LA2093 pseudomolecules. Finally, the genetic map was also used to validate the final consensus set of LA2093 pseudomolecules using ALLMAPS53 (v0.8.12). Inconsistencies between the LA2093 pseudomolecules and genetic maps were also manually checked and the accuracy of the LA2093 pseudomolecules was further validated using PacBio read alignment information.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.