We collected the pfam annotations for the 64,756 proteins encoded by publicly available genomes from IMG. Based on our eCIS operon database, we marked all the pfam domains that were found in our eCIS database as “pfam domains within eCIS operons”. Duplicated domains within a single gene (usually of repeat domains) were dropped, in order to avoid inflation of enrichment. We then counted all the occurrences for each pfam—within eCIS operons, outside eCIS operons, and in total. After counting all pfam occurrences we performed a Fisher Exact Test for pfam enrichment in eCIS operons, compared to the rest of genomes in the analysis. Multiple hypothesis testing correction was performed using the Benjamini-Hochberg procedure. The adjusted P value (q value) of the Core Component was zero, so in order to plot it we changed it to 1e−250. Then we plotted the data using R enhancedVolcano package72. We counted the number of Phyla each Pfam domain appeared in eCIS. Pfam domains appearing in >10 Phyla were marked as “Core Domains”, domains found in 4–10 Phyla were marked as “Shell Domains”, and domains found in <4 were marked as “Cloud Domains”. These terms were borrowed from pangenomics.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.