Advanced Search
Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4454 Views: 2637
Edited by: Jinfeng Chen Reviewed by: Prashanth N SuravajhalaGuotian Li
Abstract
Quality control and preprocessing of sequences are essential before analyzing high-throughput sequence data. After raw read data is generated from high-through sequencing platforms, quality control and preprocessing of sequencing reads should be implemented, for clean data to be produced for subsequent bioinformatic analysis. Different tools have been developed for this, such as FastQC, iTools, fastp, cutadapt, and FASTX. However, the usage of these approaches is difficult for first time users. To address this, transcriptome data from Illumina Hiseq 2000 paired end sequencing in the model plant Arabidopsis thaliana were used as a practical case, to show the functions and usages of these tools, which are used widely and have many features, such as good performance, wide applicability, high speed, and low requirements. For example, FastQC provides a modular set of analyses on quality control checks and gives a quick overview to show in which areas there may be problems. iTools integrates algorithms and abundant sub-functions and provides a solid foundation for special demands. Cutadapt finds and removes adapter sequences, primers, poly-A tails, and other types of unwanted sequences from high-throughput sequencing reads. Fastp provides fast all-in-one preprocessing for FastQ files, and has high performance. FASTX provides a series of functions for preprocessing reads before mapping the sequences to the genome, which manipulate the sequences to produce better mapping results. Although these tools are widely used with good performance for short reads in next-generation sequences, their applications are limited to long reads generated by third-generation sequencing, except FastQC for quality control. The codes or commands used in this study help new learners to understand these tools.
Graphical abstract:
The pipeline of quality control and preprocessing sequencing reads.
Background
As next generation sequencing technology is being widely used, sequencing data quality control and preprocessing are needed. Low data quality may be generated from adapter contamination, base content biases, overrepresented sequences, and errors in library preparation or sequencing steps. Quality control and preprocessing are effective ways to eliminate possible sequencing errors. Some relevant tools for quality control and preprocessing have been developed. For example, FastQC (Andrews, 2014) provides per-base and per-read quality profiling features. Cutadapt is used as an adapter trimmer (Martin, 2011); Trimmomatic is another trimming adapter tool (Bolger et al., 2014). FASTX-Toolkit is a collection of Linux command line tools for processing FASTQ files (Gordon and Hannon, 2010). iTools calculates the quality score of fastq and includes multiple functions by analyzing other data formats (He et al., 2013). Fastp is an ultra-fast preprocessor, which can perform quality control, adapter trimming, quality filtering, and other functions (Chen et al., 2018). Cutadapt can meet the demand of users who only need to remove the adapters in reads. Fastp can perform all aspects of preprocessing in one step, such as adapter trimming, base correction, overlapping analysis, polyG tail trimming, sliding window cutting, and global trimming. Fastx provides many command line tools for preprocessing, such as fastq-to-fasta converter, fastq/a collapser, fastq/a rename, fastq/a reverse-complement, fastq quality changer, and fastq quality trimmer. Users can use those tools with self-defined parameters.
In this article, we show how to use these common tools in quality control and preprocessing of sequencing reads. Some of them share functions, and others have specific functions. Users can choose the tools to use based on the specific demands of their analysis. As fastq format is universal for sequencing data, we use fastq data as the input for all tools. As the Ubuntu system is widely used among Linux branches, this system is used here.
Equipment
Computer (OS: Linux-branches, such as Centos and Ubuntu. We recommend at least 16GB RAM and multiple cores)
Software
FastQC (Andrews, 2014)
FastQC is a program designed to spot potential problems in high throughput sequencing datasets. If java is not installed, you can add it by doing the following:
Ubuntu: sudo apt install default-jre
The software is available for download at https://codeload.github.com/s-andrews/FastQC/zip/refs/heads/master (last accessed date: 31/10/2021) or https://github.com/s-andrews/FastQC.git downloaded by git (terminal command: git clone https://github.com/s-andrews/FastQC.git) (last accessed date: 31/10/2021). To install FastQC, unzip the zip file. A wrapper script called ‘fastqc’ is included in the top level directory of FastQC installation. You may need to make this file executable: chmod 755 fastqc. If you have conda installed on your computer, the easy recommended way is:
conda activate
conda install fastqc
iTools (He et al., 2013)
In this study, we use iTools to provide useful statistics of sequence data, which includes the Fqtools module to deal with fastq files. The Fqtools module provides multiple functions, as follows: a) summarizes the quality and amount of data, as well as the GC content; b) filters or trims the reads according to sequencing quality; c) removes reads contaminated with adapter sequences; and d) splits reads according to the index sequence. The software is available for download at https://github.com/BGI-shenzhen/Reseqtools/blob/master/iTools_Code20180520.tar.gz (last accessed date: 31/10/2021) and installation instructions are in the Install.Readme file.
Cutadapt (Martin, 2011)
Cutadapt searches for the adapter sequences in all reads and removes them. The software is available for download at https://codeload.github.com/jamescasbon/cutadapt/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: https://github.com/marcelm/cutadapt.git (last accessed date: 31/10/2021). Installation is done by: python setup.py install –user
The easiest way to install cutadapt is to use pip on the command line:
pip install –user –upgrade
Fastp (Chen et al., 2018)
A tool designed to provide fast all-in-one preprocessing for fastq files. The software is available for download at https://codeload.github.com/OpenGene/fastp/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: git clone https://github.com/OpenGene/fastp.git (last accessed date: 31/10/2021). Installation instructions are in the README.md file. Another way to install fastp with conda:
conda install -c bioconda fastp
FASTX (Gordon and Hannon, 2010)
The FASTX-Toolkit is a collection of command line tools for short reads Fasta/FastQ files preprocessing. The software is available for download at: https://codeload.github.com/agordon/fastx_toolkit/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: git clone https://github.com/agordon/ fastx_toolkit. git (last accessed date: 31/10/2021). Installation instructions are in the README file.
Here, we use two tools (FastQC and iTools) to perform quality control and three other tools (cutadapt, fastp, and FASTX) for preprocessing of sequencing reads. Command-line examples are provided to show the basic usages of these tools.
For the FastQC, an HTML report can be generated by running FastQC in non-interactive mode. It includes different modules, such as basic statistics, per base sequence quality, per sequence quality scores, per base sequence content, per sequence GC content, per base N content, sequence duplication levels, overrepresented sequences, and adapter content. A warning or failure icon on the modules indicates some kind of systematic problem. Users can interpret FastQC reports via the link (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
For iTools, a txt file is generated by running iTools in the command line. This file contains multiple information, such as GC content, base quality distribution, and read quality distribution. The percentage of reads with a quality score Q30 (99.9% base accuracy) of SRR2061397_1 and SRR2061397_1 is 93.83% and 88.24%, separately (Table 1). The quality score is important to evaluate the reliability of reads, and a high score represents high sequence quality.
Table 1. Read quality distribution generated by iTools.
Quality score | SRR2061397_1 | SRR2061397_2 |
ReadQ:0--10 | 118,45 (0.12%) | 205,370 (2.09%) |
ReadQ:10--20 | 105,784 (1.08%) | 266,545 (2.71%) |
ReadQ:20--30 | 489,384 (4.98%) | 683,440 (6.96%) |
ReadQ:30--40 | 9,217,936 (93.82%) | 8,669,973 (88.24%) |
ReadQ:40--50 | 492 (0.01%) | 113 (0.00%) |
For cutadapt, new fastq files are generated by removing adapters. Standard out shows the summary of reads processed: 0.2% of reads contain adapters, and 6.4% of reads are trimmed with low quality (Table 2). The reads satisfying ≥Q30 were considered in our analysis. Reads with low quality (<Q30) and adapters are trimmed.
Table 2. Summary of cutadapt output.
Summary | |
Total read pairs processed | 9,825,441 |
Read 1 with adapter | 17,669 (0.2%) |
Read 2 with adapter | 15,187 (0.2%) |
Pairs that were too short | 639,968 (6.5%) |
Pairs written (passing filters) | 9,185,473 (93.5%) |
Quality-trimmed | 125,590,727 bp (6.4%) |
Fastp supports automatic adapter trimming and works faster than other preprocessing tools, such as Trimmomatic or Cutadapt. Fastp improves the read quality to some extent (Table 3). The total reads and total bases are provided before and after filtering. Q20 and Q30 quality scores for reads and bases are also shown. There are 18,655,502 reads passing the filter. Other reads with low quality and too many N are discarded.
Table 3. Summary of fastp output.
Summary of fastp output | |
Read1 before filtering: | Read1 after filtering: |
total reads: 9,825,441 | total reads: 9,327,751 |
total bases: 982,544,100 | total bases: 930,706,316 |
Q20 bases: 961,474,687 (97.8556%) | Q20 bases: 920,234,151 (98.8748%) |
Q30 bases: 918,941,057 (93.5267%) | Q30 bases: 885,474,154 (95.14%) |
Read2 before filtering: | Read2 after filtering: |
total reads: 9,825,441 | total reads: 9,327,751 |
total bases: 982,544,100 | total bases: 930,706,316 |
Q20 bases: 927,573,105 (94.4052%) | Q20 bases: 910,151,132 (97.7914%) |
Q30 bases: 869,710,432 (88.5162%) | Q30 bases: 857,626,618 (92.1479%) |
Filtering result: | |
reads passed filter: 18,655,502 | |
reads failed due to low quality: 991,868 | |
reads failed due to too many N: 3,512 | |
reads failed due to being too short: 0 | |
reads with adapter trimmed: 160,162 | |
bases trimmed due to adapters: 4,215,756 |
FASTX provides a fastx_clipper tool to remove adapters and low-quality reads. Here, iTools shows the quality distribution of fastq files filtered by fastx_clipper (Table 4). Compared with data before filter, the quality after filter improves it to some degree (Table 1 and Table 4).
Table 4. The read quality summary of fastx_clipper.
SRR2061397_1 | SRR2061397_2 |
ReadQ:0--10: 9,236 (0.10%) | ReadQ:0--10: 163,020 (1.81%) |
ReadQ:10--20: 81,765 (0.91%) | ReadQ:10--20: 210,196 (2.34%) |
ReadQ:20--30: 411,863 (4.56%) | ReadQ:20--30: 590,471 (6.57%) |
ReadQ:30--40: 8,461,300 (93.69%) | ReadQ:30--40: 7,991,767 (88.93%) |
ReadQ:40--50: 67,254 (0.74%) | ReadQ:40--50: 306,81 (0.34%) |
Discussion
Here, we displayed five tools for quality control and preprocessing. The examples in this study exemplified how to use these tools. Users can explore more complex usage after learning the basic commands. The detail information for input, output, and sample data can be accessed via the website GitHub (https://github.com/Bio-protocol/Bioinformatics-Recipes-for-Plant-Genomics/tree/master).
Notes
Data used in this study are available in published databases (NCBI/EBI), and users can download them freely via different ways. Results in this study were generated by the tools described above; no additional tool was used. As the common functions of tools, the users could choose part tools, such as FastQC for quality control and fastp for preprocessing. After preprocessing, users could check their data quality by running FastQC again. We provide two ways (web page download or git clone in terminal) to download the tools in GitHub. Before using conda, anaconda should be installed (https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh). Run the file (Anaconda3-2021.05-Linux-x86_64.sh) to install anaconda in the terminal by command: sh Anaconda3-2021.05-Linux-x86_64.sh, it will be installed in a default place, and the user can activate the conda environment by command: conda activate. Then, users can ask conda to install the tools needed.
Acknowledgments
This work was funded by the National Science Foundation of China (No.31770333, No.31370329, and No.11631012), the Program for New Century Excellent Talents in University (NCET-12-0896), and the Fundamental Research Funds for the Central Universities (No. GK201403004). This paper describes protocols from https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (last accessed date: 31/10/2021) and other references (Gordon and Hannon, 2010; Martin, 2011; He et al., 2013; Chen et al., 2018).
Competing interests
The authors declare no competing interests.
References
Supplementary information
Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio-protocol/Bioinformatics-Recipes-for-Plant-Genomics.
Category
Plant Science > Plant molecular biology > Genetic analysis
Biochemistry > RNA > RNA-protein interaction
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link