In calculating frequencies of occurrence of domains and architectures, overrepresentation of redundant sequences can bias the calculations. In order to reduce this bias and compute statistics from more diverse representatives, the sequences are pre-clustered using CD-HIT [77] at a fixed threshold, and from each cluster only the representative sequence is used. Results were generated at 90%, 95%, and 99% identity, and the same general patterns emerged. The results presented here use 95% identity.

In order to ensure that the proteins being analyzed are lytic enzymes and not just peptidoglycan-associated proteins, sequences are pre-filtered to those containing an annotated CAT domain, for all analyses other than the assessment of RUFs, as these regions might provide such catalytic function.

Domain frequencies are assessed from two different viewpoints. The first viewpoint considers how many “copies” of a particular domain are necessary to construct all the representative proteins. Frequencies are computed as the number of instances of a domain divided by the total number of domains over all representative proteins; they sum to 1 and can thus be presented as a waffle chart (generated here using the python pywaffle package [81]). The second viewpoint considers how many of the representative proteins contain a particular domain at all. Here, frequencies are computed as the number of representative proteins with one or more copies of the domain, divided by the number of representative proteins. These frequencies are presented as bar charts.

The frequency of an architecture is computed as the fraction of the representative proteins with exactly that architecture. The frequency of a domain A -> domain B connection is computed as the fraction of the representative proteins containing that pair of domains in that order. For clarity of visualization, architectures are filtered for sufficient frequency, and the domains and edges are presented as graphs drawn by the Graphviz software [82] via the python pygraphviz package [83], with a custom script setting graphical parameters and laying out domains on the page according to frequency of the position of the domain between N- and C-terminus.

Sequence diversity plots use all unique sequences rather than just the representatives, in order to give a sense of the overall commonalities and differences in the proteomes. The same general analysis is performed for variants of each domain, for variants by repeat position, and for RUFs. Sequences are grouped (by organism, repeat position, etc.) and clustered by percent identity using the fastcluster python package [84]. Heatmaps are generated using the python matplotlib [85] and seaborn [86] packages.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.