Our HC-SEP set included more than one member of each of several distinct protein families. They are: beta-defensins (16), keratin-associated proteins (17), metallothioneins (12), thymosins (6), and guanine nucleotide-binding proteins (6). Since the members of a protein family tend to have similar characteristics, we selected only representative sequences from each of these families to avoid skewing the biophysical characterizations of smORFs in this study.
A multiple sequence alignment of each family using T-COFFEE(47) identified very high levels of similarity (≥95) among metallothioneins, thymosins, and guanine nucleotide-binding proteins. As such, only one representative sequence from each of those families was selected for biophysical characterization in this paper. Gene MT1A encoding for metallothionein-1A, TMSB4Y encoding for thymosin beta-4, and GNG5 encoding for guanine nucleotide-binding protein subunit gamma-5 were chosen as the representative sequences for biophysical characterization of their respective families based on their high conservation scores relative to the other sequences in their families.
The keratin-associated proteins (KAPs) did not exhibit high similarity using T-COFFEE. However, several families and subfamilies of KAPs have been previously identified(48) based on their characteristic sequences. We chose one representative sequence from each predetermined subfamily for a total of eight KAPs for inclusion in the biophysical characterization. For KAP subfamily 22, we used both sequences for the biophysical characterization as they did not align well to each other using EMBOSS Needle for pairwise alignment (identity and similarity below 38% and an alignment score of 73.5). Otherwise, we selected the representative sequence with the highest conservation score using T-COFFEE from each KAP subfamily and a total of 8 KAPs were used.
The largest family of proteins in the HC-SEP set is beta defensins. Based on a multiple sequence alignment using T-COFFEE, there is overall poor similarity of the beta defensins among the HC-SEPs. Furthermore, based on the available literature, there do not exist any clear families or subfamilies of beta defensins. This is unsurprising as the beta defensin genes are amongst the most rapidly evolving mammalian genes(49) and influence the evolution of adjacent genes(50). They have lineage specific duplications which are common alongside rapid sequence divergence. We performed a pair-wise alignment using EMBOSS Needle of all beta defensins in this study to each other and removed one member of any pair with identity of over 95%, reducing the number of beta defensins used for biophysical characterization from 16 to 10.
Present in the set were what appeared to be other protein families based upon their name, e.g. small integral membrane proteins. However, a literature search and multiple sequence alignments of these families revealed that there was no basis for assuming they are distinct protein families unto themselves. Thus, all these smORFs were included in the biophysical characterization.
The result of this pruning was a list of 140 non-redundant SEPs and we refer to this list as the NR-SEP set. The lists of sequences, with those removed and those retained, are available in the Supporting materials. Comparisons in this work are always made between size-matched lists, so that the negative and randomized control sets (described below) are matched to the NR-SEP set.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.