Identify regions of unknown function, RUFs
This protocol is extracted from research article:
Building blocks and blueprints for bacterial autolysins
PLoS Comput Biol, Apr 1, 2021; DOI: 10.1371/journal.pcbi.1008889

When the final residue position in one annotated domain is sufficiently far away from the initial residue position in the next annotated domain, we label the intermediate residues as a RUF. Short inter-domain spacing is presumably just a linker, but longer RUFs may be as-yet-undiscovered domains; the results presented here used a minimum RUF size of 50. (We call them “regions” as we don’t yet know whether they are truly domains, or maybe even contain multiple domains.) To ensure that a RUF is truly unannotated, we expand the annotation search beyond Pfam to all annotation services provided in CDD.

To identify similar RUFs, the sequences are clustered with CD-HIT using a global identity threshold of 40% and a length difference cutoff of 85%. Each cluster is given a unique identifier (RUF1, RUF2, etc.) and an associated reference sequence region based on the accession number and residue positions of the CD-HIT cluster representative. Much like domains, RUFs are stored as individual entities, including their unique identifier and associated reference sequence region. Associations between proteins and their RUFs are stored, indicating the start and stop residues in the proteins.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.