We constructed a custom python script (Abaccus) based on the ETE v2.3.10 (Huerta-Cepas et al., 2010) package that uses the previously described species taxonomy and the constructed gene trees to infer HGT events. This script is publicly available at https://github.com/Gabaldonlab/Abaccus and the version used here is v1.0. The Abaccus algorithm works as follows (see schematic example at Figure Figure11).
Schematic example of the Abaccus algorithm. We use a simple example of a tree in which a sequence from Fusarium oxysporum has been used as a seed. The tree node that is most distant to the seed has been set as the root. (A) In a first step, Abaccus progresses from the seed sequence (blue dot) toward the root and finds the sister branch of the seed sequence (red dot), in this case that node has only a descendant leaf, a sequence from F. graminearum. The taxonomy of the sequences contained in the current (seed) and sister nodes are compared. In this case both sequences share the genus level (Fusarium). The number of taxonomic levels from the current node (Fusarium oxysporum) to the lowest common taxonomic category (Fusarium) is just one (from species to genus level). Thus, parameter J = 1. Losses (L) are then established by counting how many lineages are present in the database but do not appear in the considered subtree. In this case the number of losses is L = 0. Since J and L are lower than the cutoff (J ≥ 2 and L ≥ 3), we conclude that this particular node is not the result of HGT. (B) Abaccus proceeds by setting the current node as the next node in direction to the root (blue dot), and establishes the sister node (red dot) in the same manner. In this case, the current node includes the sequences of both Fusarium species. The sister node contains a sequence from Aspergillus nidulans. The genus Fusarium and Aspergillus nidulans are both in Ascomycota, at the phylum level, making a total of 4 taxonomic jumps (J = 4, species, genus, order, and family). The family nectriaceae contains other genera beyond Fusarium (i.e., Nectria, Gibberella), and thus we count at least one loss (L = 1). At the next level, order hypocreales, we have members that are in the database and are not part of nectriaceae (i.e., Hypocrea, Cordyceps), so we count an additional loss (L = 1 + 1 = 2). We repeat the process for the next taxonomic levels, reaching total of 4 losses (L = 4). Since J > 2 and L > 3, we assume that this may be a HGT event. (C) Abaccus performs a confirmation step by repeating the same procedure in a subsequent iteration repeating the process with the next sister branch (red dot). The next sister branch includes members of three genera in the family trichocomaceae, which again has as first shared taxonomic level phylum ascomycota. Now we have that J = 4 and L = 4, for which J ≥ 2 and L ≥ 3 is true. Having a second positive result implies that we accept the F. oxysporium, along with the sequence in F. graminearum, as an HGT event.
Given a tree we define the seed protein as the eukaryotic protein of interest. The taxonomic classification of a node is the lowest taxonomic category shared between all species included in a given node. For instance the taxonomic classification between Fusarium oxysporum and F. graminearum is genus (Fusarium) while the one between F. oxysporum and Aspergillus nidulans is at phylum (Ascomycota).
We first root the tree at the farthest leaf from the seed protein found in the tree. Then we run through every node from the seed protein to the root node. For each node we determine the taxonomic classification of the node and its parent node. Then we compare the two taxonomic classifications. The “jump” parameter (J) is defined as the difference between the taxonomic level found at the parent node and the one found at the current node. As seen in Figure Figure1A1A the jump parameter between F. oxysporum and F. graminearum is equal to 1 because we move from species level to genus level while in Figure Figure1B1B the jump parameter between Fusarium and A. nidulans is equal to four because we jump from genus level to phylum level. We then compute the minimal number of loss events (L) between the node and its sister node. We use a very parsimonious approach that infers that for each taxonomic level of difference between a node and its parental node one single loss event has happened only if there is at least a species in our database belonging to that taxonomic classification level that is not present in the nodes. This also implies that no loss is inferred if no other member of a given taxonomic category is present in the database. In this study we consider nodes that have a J ≥ 2 and L ≥ 3 as possible HGT events.
When both the taxonomic distance and minimal losses criteria are met (J ≥ 2, L ≥ 3), the program checks that both criteria are also met for the next sister branch (see Figure Figure1).1). This double check was used to limit the amount of false positives and to provide information of the taxonomic range for the putative donor. If this second condition is also true, the program retrieves the phylogenetic tree as a candidate for a HGT event and is selected for manual inspection. This consists of BLAST searches against the whole Uniprot database to ensure that the predicted scope of the events is coherent and does not disappear with additional data; that the detected homology is not spurious due to low identity percent saturating the phylogenetic signal; that the identity percent is neither too high, which may imply contaminating sequences in the primary genomic data rather than true HGT; and that the observed relationships are not due to fragmented or mispredicted genes. This step is performed manually because all these tasks would be difficult for a computer to handle and would require the application of arbitrary filters that would miss some events. The manual inspection also allow the detection of particular cases, such as species belonging to clades with reduced genomes, such as intracellular parasites, for which we expect higher gene loss rate. We assessed the accuracy of Abaccus following several criteria including (i) agreement with manual curation of predicted cases; (ii) ability to detect previously detected cases (Table (Table2),2), and (iii) agreement of the automated method with the manual inspection of the phylogenetic tree of the whole Asp_glu_race family. This tree contains dozens of independent eukaryotic clades, suggesting several independent HGT events into different lineages. Abaccus is able to identify most of the clades as putative HGT (Marcet-Houben and Gabaldón, 2010).
List of HGT events described in bibliography and detected in this study.
The symbol “+” after the species name indicates that the HGT event affects several species.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.