Each Legionella effector ortholog group (LEOG) was represented by a hidden Markov model (HMM), which was constructed as follows. The proteins belonging to a given ortholog group were aligned by MAFFT73 version v7.164b using the `einsi' strategy. HMMs were constructed from the multiple sequence alignments using hmmbuild from the HMMER suite74 version 3.1b1.
Characterized domains were identified by comparing LEOG HMMs to domain databases using hhsearch version 2.0.15 from the HH-suite75. Specifically, a hhsearch with e-value threshold of 10−5 was used to find similarities between the LEOG HMMs and HMMs derived from following databases: (1) NCBI's Conserved Domain Database (CDD)76, (2) Pfam77, and (3) SMART78, which were downloaded from the HH-suite ftp site (ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/). Resulting hits were manually curated to filter out domains of unknown functions and non-informative domains. Additional characterized domains were identified during the process of novel domain detection.
Novel domains were identified as follows. All against all BLAST79 search of all 5,885 putative Legionella effectors was performed with e-value cutoff of 0.001. From the BLAST hits that received bit score > 40, we extracted maximal joined segments longer than 50 amino acids that were nearly non-overlapping (overlap < 10 amino acids). The extracted segments were searched using BLAST against the putative effector dataset using a threshold of 40 bit score. Hits of segments that had four or more hits were aligned and used to construct HMMs (as described above). These HMMs, representing conserved domains, were compared to each other using hhsearch. HMMs with homology probability score of ≥ 95% and e-value < 0.01 across at least 50% of their length were designated as describing the same domain. The detected domain HMMs were scanned for coiled-coil domains using COILS80, and domains that were ≥ 80% covered by coiled-coil domains were labeled as coiled-coiled domains. The domain HMMs were further scanned against the HMM databases of CDD76, Pfam77, and SMART78, and those with homology probability score ≥ 95% and e-value < 0.01 across at least 50% of their length were annotated according to the characterized domain (after excluding non-informative hits). The domain HMMs were used to scan the putative effectors dataset. A domain was considered as a novel Legionella effector domain if it did not overlap any characterized domain and appeared in at least 80% of the members of two different ortholog groups, each composed of at least two putative effectors.
In the effector-domain network each node represents an architecture, i.e., a combination of domains that was present in the same effector. An edge between two architecture nodes represents a domain that is shared by the two architectures. The size of each node is proportional to the number of putative effectors that had the architecture represented by the node. The network was visualized using the igraph package81 of R82. The domain architecture trees topology is of the species trees built based on 78 single copy genes as specified above. The trees were visualized using iTOL83.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.