To test whether the bilaterian-specific gene set of 157 orthogroups is enriched for transcription factors, we downloaded as control the human proteome with 20,205 protein sequences from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/ and predicted transcription factors in this dataset using the PfamScan software (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) with E-value cutoff = 5x10-05. We then determined the abundance of 10 prevalent DNA-binding domains in the dataset: "Basic; bZIP_2; HLH; HNF-1_N; Homeobox; Hox9_act; HPD; SOBP; THAP; zf-". Corresponding domains were identified in 1,756 of the 20,205 human reference proteins. We then randomly selected 10× 157 genes from the reference set and specified the number of transcription factors (proteins with the above mentioned domains) in the obtained subsets. While the average number of transcription factors in the 10 control sets was 12.8 ± 4.44, the equally sized bilaterian-specific gene set (157 orthogroups) had 37 transcription factors. Modelling a normal distribution from the obtained mean and standard deviation yielded a p-value of 2.512e-08 for the transcription factor content in bilaterian-specific genes (using the R function "pnorm"). Likewise, a Pearson’s test with the corresponding data matrix (1,765:20,205; 37:157), using the R function "chisq.test", yielded a p-value of 3.805e-08. Finally, under the assumption of a binomial distribution (R function "pbinom") and given that there are 1,756 transcription factors in 20,205 human proteins, the probability that we obtain 36 or more transcription factors when drawing 157 random proteins is p < 1.841e-08.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.