Genome Assembly and Quality Assessment

Meng Wu; David C Haak; Gregory J Anderson; Matthew W Hahn; Leonie C Moyle; Rafael F Guerrero

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Genome Assembly and Quality Assessment

MW Meng Wu

DH David C Haak

GA Gregory J Anderson

MH Matthew W Hahn

LM Leonie C Moyle

RG Rafael F Guerrero

This method is extracted from research article: Mol Biol Evol, Mar 2021

Inferring the Genetic Basis of Sex Determination from the Genome of a Dioecious Nightshade

DOI: 10.1093/molbev/msab089

Request a Protocol

Ask a question

Favorite

We followed a hybrid assembly pipeline, combining short and long reads to assemble and annotate the genome (similar to our previous work; Wu et al. 2019). We implemented the MaSuRCA v3.2.2 (Zimin et al. 2017) pipeline to generate the genome assembly, in which Illumina paired-end reads from one male plant (121× coverage from 670-1190) and one female plant (116× coverage from 670-34) were used together to trim and correct low base-call-accuracy long-reads generated by PacBio sequencing from the same two plants (19× coverage from the male 670-1190 and 21× coverage from the female 670-1190). The MaSurCA assembler has been shown to work well with heterozygous genomes, splitting highly divergent regions into separate contigs while attempting to combine regions with up to approximately 6% divergence (Zimin et al. 2013).

Genome size was estimated before assembly was carried out, based on the k-mer abundance distribution separately from the male and female reads using GenomeScope (Vurture et al. 2017), with a k-mer length of 25 bp and max k-mer coverage of 10,000. After initial assembly, all assembled scaffold sequences were aligned against bacterial, archaea, fungal, and human databases to remove potential contaminants in our assembly, using the DeconSeq tool v0.4.3 (Schmieder and Edwards 2011).

We evaluated the completeness of the genome assembly using 1,515 plant near-universal single-copy orthologs within BUSCO v3 (Simão et al. 2015). To provide an estimate of assembly accuracy at the nucleotide level, we calculated a quality score for every position in the genome assembly using the program Referee (Thomas and Hahn 2019). Referee compares the log-ratio of the sum of genotype likelihoods for the genotypes that contain the reference base (e.g., [A, A], [A, T], [A, C], and [A, G] for reference base “A”) versus the sum of those that do not contain the reference base (e.g., [T, T], [T, C], [T, G], [C, C], [C, G], and [G, G] for reference base “A”). The input used in the Referee calculation was obtained from the output pileup file from ANGSD (Korneliussen et al. 2014), which precalculated genotype likelihoods at each base of the genome assembly. Here, two genotype likelihood scores for every position in the genome assembly were calculated separately based on either the BAM file of aligned Illumina reads from the male (770-34) or the BAM file of aligned the Illumina reads from the female (670-1190).

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol