优化双RNA-Seq比对方法以精准检测复杂真核宿主中的病原体

Infanta Saleth Teresa Eden M.; Umashankar Vetrivel

doi:10.21769/BioProtoc.5182

Improve Research Reproducibility A Bio-protocol resource

提交稿件
订阅
CN
- EN - English
- CN - 中文

Peer-reviewed

Optimal Dual RNA-Seq Mapping for Accurate Pathogen Detection in Complex Eukaryotic Hosts

优化双RNA-Seq比对方法以精准检测复杂真核宿主中的病原体

IM Infanta Saleth Teresa Eden M.

UV Umashankar Vetrivel email

发布: 2025年02月05日第15卷第3期 DOI: 10.21769/BioProtoc.5182 浏览次数: 1252

评审: Prashanth N SuravajhalaSoumya MoonjelyWilly R Carrasquel-UrsulaezAnonymous reviewer(s)

PDF

Q&A

引用

Cited by

实验方案合集

Cell Imaging - A Special Collection for Cell Bio 2023

相关实验方案

蚊虫和白蛉中肠组织的优化解离方法：用于高质量单细胞RNA测序

Ana Beatriz F. Barletta [...] Carolina Barillas-Mury

2025年06月20日 1505 阅读

RACE-Nano-Seq：解析基因组特定位点的转录组多样性

Lu Tang [...] Philipp Kapranov

2025年07月05日 1552 阅读

适用于肿瘤与正常转录组数据的加权基因共表达网络分析及模块保留与功能富集分析方案

Phuong Nguyen and Erliang Zeng

2025年09月20日 718 阅读

Abstract

Dual RNA-Seq technology has significantly advanced the study of biological interactions between two organisms by allowing parallel transcriptomic analysis. Existing analysis methods employ various combinations of open-source bioinformatics tools to process dual RNA-Seq data. Upon reviewing these methods, we intend to explore crucial criteria for selecting standard tools and methods, especially focusing on critical steps such as trimming and mapping reads to the reference genome. In order to validate the different combinatorial approaches, we performed benchmarking using top-ranking tools and a publicly available dual RNA-Seq Sequence Read Archive (SRA) dataset. An important observation while evaluating the mapping approach is that when the adapter trimmed reads are first mapped to the pathogen genome, more reads align to the pathogen genome than the unmapped reads derived from the traditional host-first mapping approach. This mapping method prevents the misalignment of pathogen reads to the host genome due to their shorter length. In this way, the pathogenic read information found at lesser proportions in a complex eukaryotic dataset is precisely obtained. This protocol presents a comprehensive comparison of these possible approaches, resulting in a robust unified standard methodology.

Key features

• Benchmarking of top-ranking software for quality control, adapter trimming, and read mapping.

• Emphasizes the importance of read mapping criteria for dual RNA-Seq datasets: (i) high count of uniquely host mapped reads, (ii) low count of host multi-mapped reads, and (iii) high count of unmapped reads belonging to pathogens.

• Elaborates the best mapping approach to precisely extract the pathogen reads as these get captured comparatively less in dual RNA-Seq datasets.

Keywords: Dual RNA-Seq (双RNA-Seq)

Host–pathogen interactions (宿主-病原体相互作用)

Host-first mapping (宿主优先比对)

Pathogen-first mapping (病原体优先比对)

Misalignment of pathogen reads (病原体序列错配)

Graphical overview

Background

Dual RNA sequencing is a powerful tool to precisely investigate the complete gene expression profile of two actively interacting organisms. When a pathogen infects a host, both develop adaptation mechanisms by revealing a series of changes at molecular levels. These biological changes are interrelated and can be effectively elucidated by analyzing the whole transcriptome data of the host–pathogen interaction. Until 2012, transcriptomics sequencing of infectious diseases was limited to separately capturing the cascade of biological events in either the host or the pathogen [1]. This allows us to study gene expression changes only from a linear perspective. Over the years, insightful research in transcriptomics analysis has encouraged researchers across the globe to devise the concept of extracting and sequencing whole RNA from the infection site. This idea will provide a broader scope for unraveling the role of smaller molecules and studying the biological changes in organisms at different stages of interaction. Compared to conventional RNA sequencing technology, dual RNA sequencing caters to a sufficient amount of host and pathogen RNA and other small RNA molecules (microRNA, long non-coding RNA, and other non-coding RNA) when pathogen-infected samples are subjected to the well-established and published methods of RNA extraction and sequencing procedures [2–5].

Dual sequencing data generated as reads or fragments comprises information on both the host and the pathogen. The infected host cells may not always reflect pathogen-related data in enormous quantities. Therefore, a detail-oriented analysis is necessary to capture the minimally found essential information on host–pathogen interplay. Several bioinformatics analysis methods and tools are available to explore the complex datasets involving eukaryotic and prokaryotic reads [6–8]. Most of these methods suggest a sequential mapping approach to extract host and pathogen reads. Some studies recommend a combined mapping approach where the reference genomes of the host and pathogen are concatenated, indexed, and used as a single reference [9,10]. As far as quality control and adapter removal are concerned, standard methods for read mapping have been receiving greater attention as pathogen data is found in lesser quantities. Espindula et al. [10] conducted studies on different infection models and showed that other alternative mapping approaches outperform the traditional host-first mapping approach. When trimmed reads are mapped first to the host reference, there are high chances of pathogen read mismapping due to its shorter read length. To avoid this, alternative mapping approaches were introduced, and this bio-protocol emphasizes one such mapping technique—the pathogen-first mapping approach—in detail. A comparison of mapping results proved that most of the pathogen reads have been restored in the pathogen-first mapping approach. The results obtained using this approach are now based on a higher confidence scale and can be used for further processing.

This protocol is presented with the aim of demonstrating a standardized bioinformatics procedure for productive mapping. In this protocol, human monocyte-derived macrophages (HMDMs) infected with Mycobacterium tuberculosis (Mtb) were used as a dual RNA-Seq model. Apart from humans infected with Mtb, other host and pathogen models can still be used in dual RNA-Seq experiments. For example, other dual RNA-Seq datasets are publicly available in Gene Expression Omnibus (GEO); one such dataset is Triticum aestivum infected with Fusarium graminearum (Accession: SRP439529; GEO: GSE233409). Dual RNA-Seq test dataset of human–Mtb was downloaded from Sequence Read Archive (SRA), National Centre for Biotechnology Information (Accession: SRP359986 [11]). This study primarily focuses on a compound that restricts Mycobacterium tuberculosis from catabolizing cholesterol by binding with iron. The data quality control before and after read trimming was performed using FastQC. The FastQC tool can be accessed online at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. For trimming low-quality bases and adapter removal, benchmarking was performed using the topmost adapter trimming software, namely fastp [12] and Trim-Galore (https://github.com/FelixKrueger/TrimGalore). Following this, the topmost splice-aware alignment tools STAR [13] and HISAT2 [14] were also benchmarked to obtain optimal results. The most important part of the analysis is to apply the best mapping method that serves the experimental purpose of dual RNA-Seq analysis. After mapping the reads to their respective genomes, the reads mapping to genes were quantified using featureCounts [15]. Other downstream analysis methods, which include read count normalization, differential expression analysis, gene ontology, and pathway enrichment, are not demonstrated here, as the key aim of this protocol is to highlight the crucial preparatory steps like quality control, adapter removal, and read mapping in a descriptive manner.

Software and datasets

1. Data

Dual RNA-Seq datasets are publicly available on NCBI Sequence Read Archive (SRA) and can be downloaded and analyzed for learning purposes. For this protocol, a dataset from SRA (accession: SRP359986, Gene Expression Omnibus datasets; GEO: GSE196816) was downloaded and utilized.

2. Bioinformatics tools (all tools were installed using conda)

• SRA-Toolkit (version 3.1.0) includes tools like prefetch and fasterq-dump for fetching SRA datasets and extracting individual fastq files.

• FastQC (version 0.12.1). For assessing the quality of fastq reads before and after trimming.

• MultiQC (version 1.19). For integrating results into interactive visualization reports throughout the analysis.

• TrimGalore (version 0.6.10) and Cutadapt (version 4.6). For quality-trimming bases from reads, automatic adapter detection and removal, and filtering reads based on lengths.

• HISAT2 (version 2.2.1). For indexing the reference genome and mapping trimmed high-quality reads to the reference genome of eukaryotes.

• BWA (version 0.7.17-r1188). For indexing the reference genome and mapping trimmed high-quality reads to the reference genome of prokaryotes.

• SAMtools (version 1.19). For converting huge files from mapping results (.sam) into binary formatted .bam files enabling easy processing, to sort reads with their mate pairs, and to check the statistical distribution of reads after mapping.

• Bedtools (version v2.31.1). For extracting the interleaved reads inside .bam files into paired-end separate fastq files.

• featureCounts from Subread package (version v2.0.6). For quantifying all reads mapped to genomic coordinates using annotation feature file(.gtf/.gff3) of the reference genome.

3. Platform used: Linux, Ubuntu

• CPU: Architecture, 64 bit; 24 cores, 96 threads

• Memory: 512 GB RAM

Note: The threads/cores mentioned in each step of the analysis need to be modified by users as per the computational resources available.

Procedure

English

中文翻译

文章信息

稿件历史记录

提交日期: Aug 28, 2024

接收日期: Dec 8, 2024

在线发布日期: Dec 26, 2024

出版日期: Feb 5, 2025

版权信息

如何引用

Eden M., I. S. T. and Vetrivel, U. (2025). Optimal Dual RNA-Seq Mapping for Accurate Pathogen Detection in Complex Eukaryotic Hosts. Bio-protocol 15(3): e5182. DOI: 10.21769/BioProtoc.5182.