Extraction of Orthologs from Genome-Sequencing Data for Phylogenetic Analysis

Guan Pang; Feng M. Cai

doi:10.21769/BioProtoc.5008

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Peer-reviewed

Extraction of Orthologs from Genome-Sequencing Data for Phylogenetic Analysis

GP Guan Pang

FC Feng M. Cai email

Published: Jun 5, 2024 DOI: 10.21769/BioProtoc.5008 Views: 370

PDF

Ask a question

How to cite

Favorite

Cited by

Abstract

Homologs, including paralogs and orthologs, are genes that share sequence homologies within or between species. Determination of single-copy orthologs for phylogenomic analysis is the first step in all comparative genomic research. The current protocol provides a detailed bioinformatic pipeline from sequence data acquisition to phylogenetic reconstruction with the use of two commonly adopted tools: OrthoFinder and IQ-TREE. The protocol is demonstrated using genomic data from five fungi, including four Trichoderma spp. and an Escovopsis weberi, which served as the outgroup in the current case. Additionally, we also demonstrate a partitioned analysis for concatenated multi-locus datasets. The protocol is simple, does not require extensive bioinformatic training or special equipment, and can be easily reproduced for genome-sequencing data from other taxonomic groups.

Keywords: Gene tree

Maximum likelihood phylogeny

Background

With huge advances both in evolutionary theories and sequencing technologies, phylogenetic analysis is entering a new era—phylogenomics. Current methods for phylogenomic inference can generally be categorized into two types: supertree and supermatrix methods [1–3]. The former approach obtains one supertree by combining inferred individual gene trees, each containing information from partially overlapped sets of taxa. Alternatively, the supermatrix approach analyzes the concatenated alignment of individual genes. Unavailable genes/loci are coded as missing data in the supermatrix [2,4]. Likelihood-based reconstruction methods are particularly suited for the analysis of supermatrices. These methods consider the heterogeneity across genes referring to evolutionary rates by using partitioned-likelihood models, which allow each gene to evolve under a different substitution model. According to the total evidence principle of using all the relevant available data, it is somewhat more popular as a strategy to adopt the supermatrix method, which is also used in the current pipeline of demonstrated examples.

The two crucial steps of standard phylogenetic inference are the identification of homologous sequences and tree reconstruction. Therefore, besides the accuracy of the tree-building method, the reliability of a phylogenomic tree also largely depends on the quality of homology, that is, the determination of paralogs and orthologs within and between genomes [5]. In contrast to paralogs, which are derived from gene duplication and should thus be excluded from phylogenetic analyses, orthologs are genes that are derived from speciation events; orthology, in this case, refers to the relationship between the corresponding genes in different species. So far, the most widely used methods for orthology inference can be classified into two groups [6]. One group infers pairwise relationships between genes in two species and then to multiple species, while the other identifies complete orthogroups (OGs), which are identified as the set of genes descended from a single gene in the last common ancestor of all of the species considered [5,7].

In the current pipeline, we use OrthoFinder, a popular method for inferring OGs of protein-coding genes. Starting from gene sequences (the input files), an advantage of using this program is that, by default, it infers OGs, orthologs, the complete set of gene trees for all OGs, the rooted species tree, and all possible gene duplication events. Furthermore, it also provides extensive comparative genomics statistics [5,7]. Despite the fact that OrthoFinder generates individual gene trees and the species tree, for customized tree building we recommend IQ-TREE (IQ-TREE 2 here) for subsequent analyses. In our laboratory, when working with multiple fungal genomes, IQ-TREE runs fast and provides automatic model selection, which also includes data partitioning, an efficient search algorithm for ML trees, ultrafast bootstrapping, and more [8]. With this protocol, we aim to demonstrate effective examples of orthology inference, ortholog extraction, paralog exclusion, individual gene tree reconstruction with the ML method, and data partitioning. This is done using five fungal genomes, with the four main members belonging to the genus Trichoderma. Trichoderma spp. are among the best studied groups of filamentous fungi due to their high value in applications from agriculture to industrial enzyme production [9]. The present protocol is simple and can also be easily adopted for genomic data from other organisms.

Equipment

We explicitly assume that the user has some basic skills in working in a Linux-based operating system.

Linux cluster
In the present study, we used the AuthenticAMD supercomputer, which has two nodes, each containing 32 cores (model name: AMD EPYC 7452 32-Core Processor) and 256 GB of memory in total
Personal computer (PC)
We recommend using a PC with an Intel Core i7-10510U CPU or higher and at least 16 GB of RAM for sufficient post data processing

Software

OrthoFinder ([5,7], v2.5.4, https://github.com/davidemms/OrthoFinder)
IQ-TREE (8,10, v2.2.0.3, https://github.com/iqtree/iqtree2)
Note: The required software and its dependents (including IQ-TREE, which is also integrated in OrthoFinder) should be installed properly according to the tutorials mentioned above before the analysis.

Procedure