To add additional layers of information to the predicted interactions, we compared the proteins involved against several databases that contain information about the infection phenotype, namely BacFITbase [20], DualSeqDB [21], and PHI-base [22] (Fig. 3B).
1. Sequence alignment against BacFITbase: We downloaded BacFITBase v1.0 (accessed 2 September 2021), a database that contains information on bacterial fitness, as measured by transposon mutagenesis. To determine the similarity between pathogen proteins in the predicted interactome and entries in BacFITBase, HPIPred performs a BLAST sequence alignment between the query pathogen protein and the entire database, keeping all hits with a percentage of identity ≥ 40 % and an E-value < 10 that have a significant fitness score in BacFITBase (adjusted p-value ≤ 0.05). If multiple entries are retrieved, HPIPred assigns the average fitness score, and stores the mean standard deviation. Queries with no hits are labeled as “NA”. Finally, as BacFITbase assigns the lowest fitness scores to the most relevant proteins, the values are normalized from 0 to 1, assigning a value of 1 to the lowest fitness score reported, and a value of 0 to the highest.
2. Sequence alignment against DualSeqDB: We downloaded DualSeqDB 1.0 (accessed 2 September 2021), a database that contains information on gene expression changes in bacterial infection models. Changes in gene expression are represented as the log2 fold change, as measured by dual RNA-Seq experiments. HPIPred performs a protein sequence alignment between each query protein and DualSeqDB, for both the bacterial and host fractions, keeping those hits with a percentage of identity ≥ 40 % and an E-value < 10 that have a significant expression change score in DualSeqDB (adjusted p-value ≤ 0.05). The average log2 fold change is assigned to the query protein and the standard deviation is stored. Queries with no remaining hits are labeled as “NA”. The standard 0–1 normalization is performed at the end, assigning a value of 1 to the highest reported fold change score, and a value of 0 to the lowest.
3. Sequence alignment against PHI-base: We downloaded PHI-base (accessed 2 September 2021), a dataset containing information on the role of pathogenic genes in bacteria. PHI-base assigns to each entry a “mutant phenotype”, depending on how the gene deletion or mutation affects the pathogenicity of the organism. In some cases, the same gene can have more than one entry, since it may have been measured in different experiments. We filtered out those entries referring to pathogens that do not belong to the bacterial kingdom. Then, we only kept entries with mutant phenotype tags that matched “unaffected pathogenicity”, “loss of pathogenicity”, “reduced virulence”, “lethal” or “increased virulence (hypervirulence)”, and transformed them into numerical values, 0, 0.5 or 1 as follows: “lethal” = 1, “loss of pathogenicity” = 1, “reduced virulence” = 0.5, “increased virulence (hypervirulence)” = 0.5, “unaffected pathogenicity” = 0.
In the event that discrepant phenotypes were reported for the same database entry, the most abundant tag is assigned. HPIPred then performs a protein sequence alignment between each query protein and PHI-base. Hits with a percentage of identity ≥ 40 % and an E-value < 10 are retained. Surviving queries are assigned an average PHI-base score and the mean standard deviation is stored. “NA” labels are assigned to queries with no hits.
4. Betweenness centrality of host proteins: As suggested by the centrality-lethality rule [23], [24], proteins that are central in the interactome are more likely to be essential for the organism. In this sense, betweenness centrality (BC) is a relevant centrality measure, as nodes with high betweenness are located on key communication routes and control network integrity [25]. Hence, we used BC as a proxy for protein relevance in the host. To measure protein essentiality in the Homo sapiens proteome, we calculated the BC score for all proteins. The Homo sapiens interactome was downloaded from the STRING database [26] (accessed 2 September 2021. We filtered out all PPIs with a confidence score lower than 0.9. We then used the R-package igraph [27] to build an undirected network, calculated the node BC score for all nodes in the graph, each representing a human protein, and performed a standard 0–1 normalization, being 0 the protein with the lowest BC score and 1 with the protein with the highest score (Fig. 3B).
5. Calculation of the ranked score: For each predicted PPI in the combined interactome we compiled all the normalized scores obtained in the previous steps (BacFITBase, DualSeqDB, PHI-base scores for the pathogen proteins, and betweenness centrality and DualSeqDB scores for the host proteins) and calculated an average score with a value ranging from 0 to 1 using the following equation:
where AvS is the average score, F is the fitness value, Eh and Ep are the log2 fold change in expression for host and pathogen, respectively, P is the infection phenotype, BC is the host betweenness centrality, and NM is the number of non-missing values. Then, a phenotypic weight (PW) weight was calculated to consider the number of missing values (NA) in the previous formula. Hence, a PW of 5 means no missing values, and 0 that no values were reported for that specific PPI. To account for both the average score and the confidence weight, we calculate a normalized ranked score (RS) (Fig. 3B):
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.