2.2. Dataset Criteria and Selection

Randy Ortiz; Priyanka Gera; Christopher Rivera; Juan C. Santos

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.2. Dataset Criteria and Selection

RO Randy Ortiz

PG Priyanka Gera

CR Christopher Rivera

JS Juan C. Santos

This method is extracted from research article: Genes (Basel), Jun 2021

Pincho: A Modular Approach to High Quality De Novo Transcriptomics

DOI: 10.3390/genes12070953

Request a Protocol

Ask a question

Favorite

We analyzed eight distinct non-model datasets from the SRA ([53]; Table 2. We focused on hyloid anurans (frogs) that have complex and usually large genomes (e.g., ~6.76 Gb for Dendrobates pumilio, [54]). Data was chosen via the following criteria: (a) publicly sourced RNA-seq data, (b) paired-end reads of various insert sizes (Table 2), (c) fastq format, (d) Illumina sequencing, (e) non-model organisms, (f) data containing a base count lower than 2Gb and (g) data that passed Pincho’s rapid assessment with a complete BUSCO score greater than 50%. Rapid assessment is composed of fasterq-dump download of SRR raw reads, removal of Illumina adaptors, if necessary, from raw data via Trimmomatic, assembly of reads via succinct de Bruijn graphs with MEGAHIT and assessment via BUSCO scores. Chosen SRA files were analyzed with FastQC [55], revealing that all files were adapter free.

Test NGS Dataset from NCBI SRA database.

¹ Complete BUSCO using Pincho’s rapid assessment at default settings ² Olfactory Bulb.

Our datasets are purposely under the standard yield of RNA-seq experiments (2GB –4GB), to highlight the potential of the selected assemblers on low yield, low coverage datasets. As higher levels of sequencing coverage lead to higher quality NGS data [56], we chose NGS data that are most likely to contain low sample coverage owing to low read counts [57]. We selected smaller sized files on average 6.88M reads, which is well beneath the recommended sequencing read number of 20M [56] to ensure an NGS scenario of low coverage. As a balance we made sure that all files were at least above 50% in complete BUSCO scores to avoid scenarios where read coverage was insufficient. Low coverage datasets are prone to many types of assembly errors (i.e., fragmentation and incompleteness [32]), which allows us to accurately test the various types of algorithms employed by the tested transcriptome assemblers and their abilities to work with problematic datasets. It is only under this scope that we can ideally view assembler performance and synergy without the reliance on synthetic data. We expect that if assemblers succeed at reconstructing more from smaller datasets, then they are sensitive enough to use on larger datasets as well.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol