Taxonomic affiliation of sequences was based on hits to the RefSeqABVir database. Each profile in the database was first affiliated based on the origin of its members, with a 75% majority rule: at each taxonomic level, a profile is affiliated to a taxon if more than 75% of the profile sequences are affiliated to this taxon. Then, for each of the 12,498 viral sequences identified by VirSorter, a set of relevant hits was selected: (i) first the profile with the best hit across all genes along the sequence, and (ii) the best hit from other genes with a score close to this ‘absolute’ best hit in the sequence (>75% of the score of the first best hit). The sequence was then affiliated to the Lowest Common Ancestor (LCA) of this set of relevant hits. Hence, a predicted protein will only be affiliated if pointing toward sequences or profiles typical of a viral group, and a sequence detected by VirSorter will only be affiliated if its best hits are consistent. Functional affiliation for each PC was based on the comparison of its members (predicted proteins) with PFAM (v. 27, threshold of 50 on score). VCs were affiliated based on its members affiliations if >75% were consistent.
For the detection of new genera in the VCs, we chose to ignoring the 79 VCs that lacked large (>30 kb) genome sequences. This 30 kb threshold is conservative as it avoids considering short genome fragments as new genera but would also overlook small non-circular viral genomes (such as some Tectiviridae). However, because the latter comprise a minority (∼0.1% of 12,498 sequences) of the VirSorter data set (Figure 2), we chose to retain the larger, more conservative threshold.
The 7 short circular sequences from Bacteroidia only detected with the Viromes database (gi 319430465, 298484481, 329959038, 423221334, 423242675, 423298785, 345651594) were targeted for further examination. Hits to PFAM domains could be found on two proteins: a relaxase (PF03432.9, score ∼170), and one replication initiator protein (PF01051.16, score ∼80). Genome organization was compared with Easyfig (Sullivan et al., 2011) after aligning all genomes to the same starting point (one base before the start of the Rep-domain protein). Recruitment plots of virome contigs (extracted from Kim et al., 2011; Minot et al., 2012) were generated with ggplot2 and based on blastn comparison.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.