The assignment of the TSS/TES clusters to their nearby genes was done by “bedtools closest” [89]. Assignment of TSSs to the nearest TES was also done by “bedtools closest.” Each TSS/TES was only assigned to a single closest, non-overlapping gene feature. To remove spurious TSSs, we sequenced a non-decapping control sample for the 5′ end data. We noticed that some TSS clusters which are located within the gene body also show enriched read signals in the non-decapping control, suggesting these clusters might not be genuine TSSs. To remove this type of potential artifacts, for each called TSS cluster, we calculated the Spearman correlation between the non-decapping sample and the 0 h sample using the base-wise read counts in the region of TSS cluster ± 5 bp. We excluded the suspicious 273 TSS clusters that show significant correlation between non-decapping sample and the 0 h sample (Spearman’s r > 0.5 and p < 0.05) from downstream analysis.
The remaining TSS clusters were first assigned to downstream gene features in the same orientation if they were within 100 bp from the start of the feature. Some TSS clusters are further than 100 bp but still within 1000 bp of a downstream gene feature in the same orientation. In such cases, a 30-bp window was slid from the TSS cluster to the start of the gene. These clusters were only assigned to the gene if median read counts in all the windows were greater than zero.
TES clusters were assigned to upstream gene features in the same orientation if they were within 100 bp from the end of the feature. Some TES clusters are further than 100 bp but still within 1000 bp of an upstream gene feature in the same orientation. In such cases, additional assignment criteria were adopted, modified from a previously described approach [11]. A 30-bp window was slid from the TES cluster to the end of the gene. These clusters were only assigned to the gene if median read counts in all the windows were greater than zero. In addition, the median read count in each window had to be greater than or equal to 5% of the median read count over the gene feature. Lastly, the maximum read count in each of the intervals had to be less than or equal to five times the maximum read count over the gene feature.
The output from “bedtools closest” was also used to determine the 5′ and 3′ UTRs length. The 5′ UTR length is defined as the distance given in number of nucleotides from the apex of a TSS cluster to the AUG of an annotated ORF. The 3′ UTR length is defined as the distance given in the number of nucleotides from the apex of a TES cluster to the stop codon of an annotated ORF. TSS/TES clusters were also assigned to genes if they were located within gene bodies in the same orientation.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.