Quality Control and Preprocessing of Sequencing Reads

Zhiqiang  Hao; Xiaojuan  Liang; Guanglin  Li

doi:10.21769/BioProtoc.4454

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Peer-reviewed

Quality Control and Preprocessing of Sequencing Reads

ZH Zhiqiang Hao

XL Xiaojuan Liang

GL Guanglin Li email

Published: Jul 5, 2022 DOI: 10.21769/BioProtoc.4454 Views: 2637

Edited by: Jinfeng Chen Reviewed by: Prashanth N Suravajhala Guotian Li

Download PDF

Ask a question

Favorite

Cited by

Abstract

Quality control and preprocessing of sequences are essential before analyzing high-throughput sequence data. After raw read data is generated from high-through sequencing platforms, quality control and preprocessing of sequencing reads should be implemented, for clean data to be produced for subsequent bioinformatic analysis. Different tools have been developed for this, such as FastQC, iTools, fastp, cutadapt, and FASTX. However, the usage of these approaches is difficult for first time users. To address this, transcriptome data from Illumina Hiseq 2000 paired end sequencing in the model plant Arabidopsis thaliana were used as a practical case, to show the functions and usages of these tools, which are used widely and have many features, such as good performance, wide applicability, high speed, and low requirements. For example, FastQC provides a modular set of analyses on quality control checks and gives a quick overview to show in which areas there may be problems. iTools integrates algorithms and abundant sub-functions and provides a solid foundation for special demands. Cutadapt finds and removes adapter sequences, primers, poly-A tails, and other types of unwanted sequences from high-throughput sequencing reads. Fastp provides fast all-in-one preprocessing for FastQ files, and has high performance. FASTX provides a series of functions for preprocessing reads before mapping the sequences to the genome, which manipulate the sequences to produce better mapping results. Although these tools are widely used with good performance for short reads in next-generation sequences, their applications are limited to long reads generated by third-generation sequencing, except FastQC for quality control. The codes or commands used in this study help new learners to understand these tools.

Graphical abstract:

The pipeline of quality control and preprocessing sequencing reads.

Keywords: Quality control

Background

As next generation sequencing technology is being widely used, sequencing data quality control and preprocessing are needed. Low data quality may be generated from adapter contamination, base content biases, overrepresented sequences, and errors in library preparation or sequencing steps. Quality control and preprocessing are effective ways to eliminate possible sequencing errors. Some relevant tools for quality control and preprocessing have been developed. For example, FastQC (Andrews, 2014) provides per-base and per-read quality profiling features. Cutadapt is used as an adapter trimmer (Martin, 2011); Trimmomatic is another trimming adapter tool (Bolger et al., 2014). FASTX-Toolkit is a collection of Linux command line tools for processing FASTQ files (Gordon and Hannon, 2010). iTools calculates the quality score of fastq and includes multiple functions by analyzing other data formats (He et al., 2013). Fastp is an ultra-fast preprocessor, which can perform quality control, adapter trimming, quality filtering, and other functions (Chen et al., 2018). Cutadapt can meet the demand of users who only need to remove the adapters in reads. Fastp can perform all aspects of preprocessing in one step, such as adapter trimming, base correction, overlapping analysis, polyG tail trimming, sliding window cutting, and global trimming. Fastx provides many command line tools for preprocessing, such as fastq-to-fasta converter, fastq/a collapser, fastq/a rename, fastq/a reverse-complement, fastq quality changer, and fastq quality trimmer. Users can use those tools with self-defined parameters.

In this article, we show how to use these common tools in quality control and preprocessing of sequencing reads. Some of them share functions, and others have specific functions. Users can choose the tools to use based on the specific demands of their analysis. As fastq format is universal for sequencing data, we use fastq data as the input for all tools. As the Ubuntu system is widely used among Linux branches, this system is used here.

Equipment

Computer (OS: Linux-branches, such as Centos and Ubuntu. We recommend at least 16GB RAM and multiple cores)

Software

FastQC (Andrews, 2014)
FastQC is a program designed to spot potential problems in high throughput sequencing datasets. If java is not installed, you can add it by doing the following:
Ubuntu: sudo apt install default-jre
The software is available for download at https://codeload.github.com/s-andrews/FastQC/zip/refs/heads/master (last accessed date: 31/10/2021) or https://github.com/s-andrews/FastQC.git downloaded by git (terminal command: git clone https://github.com/s-andrews/FastQC.git) (last accessed date: 31/10/2021). To install FastQC, unzip the zip file. A wrapper script called ‘fastqc’ is included in the top level directory of FastQC installation. You may need to make this file executable: chmod 755 fastqc. If you have conda installed on your computer, the easy recommended way is:
conda activate
conda install fastqc
iTools (He et al., 2013)
In this study, we use iTools to provide useful statistics of sequence data, which includes the Fqtools module to deal with fastq files. The Fqtools module provides multiple functions, as follows: a) summarizes the quality and amount of data, as well as the GC content; b) filters or trims the reads according to sequencing quality; c) removes reads contaminated with adapter sequences; and d) splits reads according to the index sequence. The software is available for download at https://github.com/BGI-shenzhen/Reseqtools/blob/master/iTools_Code20180520.tar.gz (last accessed date: 31/10/2021) and installation instructions are in the Install.Readme file.
Cutadapt (Martin, 2011)
Cutadapt searches for the adapter sequences in all reads and removes them. The software is available for download at https://codeload.github.com/jamescasbon/cutadapt/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: https://github.com/marcelm/cutadapt.git (last accessed date: 31/10/2021). Installation is done by: python setup.py install –user
The easiest way to install cutadapt is to use pip on the command line:
pip install –user –upgrade
Fastp (Chen et al., 2018)
A tool designed to provide fast all-in-one preprocessing for fastq files. The software is available for download at https://codeload.github.com/OpenGene/fastp/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: git clone https://github.com/OpenGene/fastp.git (last accessed date: 31/10/2021). Installation instructions are in the README.md file. Another way to install fastp with conda:
conda install -c bioconda fastp
FASTX (Gordon and Hannon, 2010)
The FASTX-Toolkit is a collection of command line tools for short reads Fasta/FastQ files preprocessing. The software is available for download at: https://codeload.github.com/agordon/fastx_toolkit/zip/refs/heads/master (last accessed date: 31/10/2021), and the git clone command is: git clone https://github.com/agordon/ fastx_toolkit. git (last accessed date: 31/10/2021). Installation instructions are in the README file.

Input data

FastQ format
This is the most widely used format in sequence analysis. The format contains more information than the fasta format, through integrating quality scores. Each sequence requires at least four lines:
1. The first line is the sequence header which starts with an ‘@’.
2. The second line is the sequence.
3. The third line starts with ‘+’.
4. The fourth line contains the quality scores.
We downloaded two transcriptome fastq data by wget. The SRR accession number or Project number could be described in published studies with SRA data uploaded into NCBI. The user can find the SRR accession number in the SRA database, based on Project number in NCBI, and download the fastq format data from EBI website (https://www.ebi.ac.uk) as alternative access. The user should search by the SRR number, and download links can be found on the result page. The users can record the download links of fastq files to download data in the terminal. In this study, the run accession numbers we used are SRR2061397 and SRR2061398. The sequencing platform is Illumina Hiseq 2000 paired end sequencing. The organism is Arabidopsis thaliana, and the method to download links is as follows (last accessed date: 15/7/2021):
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/007/SRR2061397/SRR2061397_1.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/007/SRR2061397/SRR2061397_2.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/008/SRR2061398/SRR2061398_1.fastq.gz
wget -c http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR206/008/SRR2061398/SRR2061398_2.fastq.gz

Case study

FastQC provides a modular set of analyses, which can be used to give a quick impression of whether the data has any problems that you should be aware of before doing any further analysis. The main functions of FastQC are:

Import data from BAM, SAM, or FastQ files.
Providing a quick overview to tell you in which areas there may be problems.
Summary graphs and tables to quickly assess your data.
Export of result to an HTML-based permanent report.

FastQC can be run either as an interactive graphical application. Alternatively, you can run the program in a non-interactive way.

You can run it directly:
./fastqc
If you do not specify any files to process, the program will try to open the interactive application. Click the file button and choose fastq files located on your computer. Click the confirm button, and wait several minutes for your reports.
Run fastqc from the command line, like this:
fastqc -t 8 -o outdir SRR2061397_1.fastq SRR2061397_2.fastq SRR2061398_1.fastq\ SRR2061397_2.fastq
Parameter description: -t for CPU number, -o for output directory.
iTools is a toolkit for analyzing next-generation sequencing data. One module of iTools is Fqtools, which processes the fastq sequence file. Here, we show one of its functions as follows: summarizes the quality and amount of data, as well as the GC content.
Run iTools from the command line like this:
iTools Fqtools stat -InFq SRR2061397_1.fastq -InFq SRR2061397_2.fastq -InFq\ SRR2061398_1.fastq -InFq SRR2061398_2.fastq -OutStat read.info -CPU 8
Parameter description: -InFq for input file, -OutStat for the output file, -CPU: CPU number
Cutadapt searches for the adapter sequence in all reads and removes it.
The command-line for cutadapt is:
cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 30 -m 20 –trim-n -O 10\ -o SRR2061397_1trimmed.fastq -p SRR2061397_2¬trimmed.fasq SRR2061397_1.fastq SRR2061397_2.fastq
Parameter description: -a for sequence of an adapter ligated to the 3’ end, -A 3’ adapter to be removed from second read in a pair, -q trim low-quality bases from 5’ and 3’ ends of each read. -m 20 for discard trimmed reads that are shorter than 20. -O MINLENGTH if the overlap between the read and the adapter is shorter than MINLENGTH, the read is not modified, reduces the number of bases trimmed due to random adapter match. -o output file -p paired-output.
Users need to provide an adapter string, and the adapter may vary for different types of sequences.
fastp can perform quality control, adapter trimming, quality filtering, and per-read quality pruning with a single scan of fastq data.
The command-line for fastp is:
fastp -i SRR2061397_1.fastq -I SRR2061397_2.fastq -o SRR2061397_1clean.fastq -O\ SRR2061397_2clean.fastq
Parameter description: -i read1 input file, -I read2_inputfile, -o read1 output file, -O read2_output file.
Different tools in the FASTX-Toolkit perform a list of preprocessing tasks, such as converting fastq files to fasta files, removing sequencing adapters, filtering sequences based on quality, shortening reads, and trimming sequences based on quality.
Here, one tool as example is shown, as follows:
fastx_clipper -a AGATCGGAAGAGC -l 25 -d 0 -Q 33 -i SRR2061397_1.fastq -o SRR2061397_1trimmed.fastq
Parameter: -a for adapter string, -l for discard sequence shorter than N nucleotides. -d N keep the adapter and N bases after it. -i input file, -o output file.

Result Interpretation

Here, we use two tools (FastQC and iTools) to perform quality control and three other tools (cutadapt, fastp, and FASTX) for preprocessing of sequencing reads. Command-line examples are provided to show the basic usages of these tools.

For the FastQC, an HTML report can be generated by running FastQC in non-interactive mode. It includes different modules, such as basic statistics, per base sequence quality, per sequence quality scores, per base sequence content, per sequence GC content, per base N content, sequence duplication levels, overrepresented sequences, and adapter content. A warning or failure icon on the modules indicates some kind of systematic problem. Users can interpret FastQC reports via the link (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).

For iTools, a txt file is generated by running iTools in the command line. This file contains multiple information, such as GC content, base quality distribution, and read quality distribution. The percentage of reads with a quality score Q30 (99.9% base accuracy) of SRR2061397_1 and SRR2061397_1 is 93.83% and 88.24%, separately (Table 1). The quality score is important to evaluate the reliability of reads, and a high score represents high sequence quality.

Table 1. Read quality distribution generated by iTools.

Quality score	SRR2061397_1	SRR2061397_2
ReadQ:0--10	118,45 (0.12%)	205,370 (2.09%)
ReadQ:10--20	105,784 (1.08%)	266,545 (2.71%)
ReadQ:20--30	489,384 (4.98%)	683,440 (6.96%)
ReadQ:30--40	9,217,936 (93.82%)	8,669,973 (88.24%)
ReadQ:40--50	492 (0.01%)	113 (0.00%)

For cutadapt, new fastq files are generated by removing adapters. Standard out shows the summary of reads processed: 0.2% of reads contain adapters, and 6.4% of reads are trimmed with low quality (Table 2). The reads satisfying ≥Q30 were considered in our analysis. Reads with low quality (<Q30) and adapters are trimmed.

Table 2. Summary of cutadapt output.

Summary
Total read pairs processed	9,825,441
Read 1 with adapter	17,669 (0.2%)
Read 2 with adapter	15,187 (0.2%)
Pairs that were too short	639,968 (6.5%)
Pairs written (passing filters)	9,185,473 (93.5%)
Quality-trimmed	125,590,727 bp (6.4%)

Fastp supports automatic adapter trimming and works faster than other preprocessing tools, such as Trimmomatic or Cutadapt. Fastp improves the read quality to some extent (Table 3). The total reads and total bases are provided before and after filtering. Q20 and Q30 quality scores for reads and bases are also shown. There are 18,655,502 reads passing the filter. Other reads with low quality and too many N are discarded.

Table 3. Summary of fastp output.

Summary of fastp output
Read1 before filtering:	Read1 after filtering:
total reads: 9,825,441	total reads: 9,327,751
total bases: 982,544,100	total bases: 930,706,316
Q20 bases: 961,474,687 (97.8556%)	Q20 bases: 920,234,151 (98.8748%)
Q30 bases: 918,941,057 (93.5267%)	Q30 bases: 885,474,154 (95.14%)

Read2 before filtering:	Read2 after filtering:
total reads: 9,825,441	total reads: 9,327,751
total bases: 982,544,100	total bases: 930,706,316
Q20 bases: 927,573,105 (94.4052%)	Q20 bases: 910,151,132 (97.7914%)
Q30 bases: 869,710,432 (88.5162%)	Q30 bases: 857,626,618 (92.1479%)

Filtering result:
reads passed filter: 18,655,502
reads failed due to low quality: 991,868
reads failed due to too many N: 3,512
reads failed due to being too short: 0
reads with adapter trimmed: 160,162
bases trimmed due to adapters: 4,215,756

FASTX provides a fastx_clipper tool to remove adapters and low-quality reads. Here, iTools shows the quality distribution of fastq files filtered by fastx_clipper (Table 4). Compared with data before filter, the quality after filter improves it to some degree (Table 1 and Table 4).

Table 4. The read quality summary of fastx_clipper.

SRR2061397_1	SRR2061397_2
ReadQ:0--10: 9,236 (0.10%)	ReadQ:0--10: 163,020 (1.81%)
ReadQ:10--20: 81,765 (0.91%)	ReadQ:10--20: 210,196 (2.34%)
ReadQ:20--30: 411,863 (4.56%)	ReadQ:20--30: 590,471 (6.57%)
ReadQ:30--40: 8,461,300 (93.69%)	ReadQ:30--40: 7,991,767 (88.93%)
ReadQ:40--50: 67,254 (0.74%)	ReadQ:40--50: 306,81 (0.34%)

Discussion

Here, we displayed five tools for quality control and preprocessing. The examples in this study exemplified how to use these tools. Users can explore more complex usage after learning the basic commands. The detail information for input, output, and sample data can be accessed via the website GitHub (https://github.com/Bio-protocol/Bioinformatics-Recipes-for-Plant-Genomics/tree/master).

Notes

Data used in this study are available in published databases (NCBI/EBI), and users can download them freely via different ways. Results in this study were generated by the tools described above; no additional tool was used. As the common functions of tools, the users could choose part tools, such as FastQC for quality control and fastp for preprocessing. After preprocessing, users could check their data quality by running FastQC again. We provide two ways (web page download or git clone in terminal) to download the tools in GitHub. Before using conda, anaconda should be installed (https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh). Run the file (Anaconda3-2021.05-Linux-x86_64.sh) to install anaconda in the terminal by command: sh Anaconda3-2021.05-Linux-x86_64.sh, it will be installed in a default place, and the user can activate the conda environment by command: conda activate. Then, users can ask conda to install the tools needed.

Acknowledgments

This work was funded by the National Science Foundation of China (No.31770333, No.31370329, and No.11631012), the Program for New Century Excellent Talents in University (NCET-12-0896), and the Fundamental Research Funds for the Central Universities (No. GK201403004). This paper describes protocols from https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (last accessed date: 31/10/2021) and other references (Gordon and Hannon, 2010; Martin, 2011; He et al., 2013; Chen et al., 2018).

Competing interests

The authors declare no competing interests.

References

Andrews, S. (2014). FastQC A Quality Control tool for High Throughput Sequence Data.
Bolger, A. M., Lohse, M. and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114-2120.
Chen, S., Zhou, Y., Chen, Y. and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17): i884-i890.
Gordon, A. and Hannon, G. J. (2010). Fastx-toolkit. FASTQ/A short-reads pre-processing tools.
He, W., Zhao, S., Liu, X., Dong, S., Lv, J., Liu, D., Wang, J. and Meng, Z. (2013). ReSeq Tools: An integrated toolkit for large-scale next-generation sequencing based resequencing analysis. Genet Mol Res 12(4): 6275-6283.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. Embnet Journal 17(1).

Supplementary information

Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio-protocol/Bioinformatics-Recipes-for-Plant-Genomics.

Please login or sign up for free to view full text