We followed a hybrid assembly pipeline, combining short and long reads to assemble and annotate the genome (similar to our previous work; Wu et al. 2019). We implemented the MaSuRCA v3.2.2 (Zimin et al. 2017) pipeline to generate the genome assembly, in which Illumina paired-end reads from one male plant (121× coverage from 670-1190) and one female plant (116× coverage from 670-34) were used together to trim and correct low base-call-accuracy long-reads generated by PacBio sequencing from the same two plants (19× coverage from the male 670-1190 and 21× coverage from the female 670-1190). The MaSurCA assembler has been shown to work well with heterozygous genomes, splitting highly divergent regions into separate contigs while attempting to combine regions with up to approximately 6% divergence (Zimin et al. 2013).
Genome size was estimated before assembly was carried out, based on the k-mer abundance distribution separately from the male and female reads using GenomeScope (Vurture et al. 2017), with a k-mer length of 25 bp and max k-mer coverage of 10,000. After initial assembly, all assembled scaffold sequences were aligned against bacterial, archaea, fungal, and human databases to remove potential contaminants in our assembly, using the DeconSeq tool v0.4.3 (Schmieder and Edwards 2011).
We evaluated the completeness of the genome assembly using 1,515 plant near-universal single-copy orthologs within BUSCO v3 (Simão et al. 2015). To provide an estimate of assembly accuracy at the nucleotide level, we calculated a quality score for every position in the genome assembly using the program Referee (Thomas and Hahn 2019). Referee compares the log-ratio of the sum of genotype likelihoods for the genotypes that contain the reference base (e.g., [A, A], [A, T], [A, C], and [A, G] for reference base “A”) versus the sum of those that do not contain the reference base (e.g., [T, T], [T, C], [T, G], [C, C], [C, G], and [G, G] for reference base “A”). The input used in the Referee calculation was obtained from the output pileup file from ANGSD (Korneliussen et al. 2014), which precalculated genotype likelihoods at each base of the genome assembly. Here, two genotype likelihood scores for every position in the genome assembly were calculated separately based on either the BAM file of aligned Illumina reads from the male (770-34) or the BAM file of aligned the Illumina reads from the female (670-1190).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.