Quality controls and preliminary analyses

Mehrnush Forutan; Elizabeth Ross; Amanda J. Chamberlain; Loan Nguyen; Brett Mason; Stephen Moore; Josie B. Garner; Ruidong Xiang; Ben J. Hayes

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Quality controls and preliminary analyses

MF Mehrnush Forutan

ER Elizabeth Ross

AC Amanda J. Chamberlain

LN Loan Nguyen

BM Brett Mason

SM Stephen Moore

JG Josie B. Garner

RX Ruidong Xiang

BH Ben J. Hayes

This method is extracted from research article: Commun Biol, Jul 2021

Evolution of tissue and developmental specificity of transcription start sites in Bos taurus indicus

DOI: 10.1038/s42003-021-02340-6

Ask a question

Favorite

Only primary alignments with a quality of greater than 20 (<99% chance of true) were considered for TSS calling. Furthermore, the CAGE tag starting sites (CTSSs), a nucleotide position on the genome from which an alignment of CAGE tag starts, supported by 3 or more CAGE read 5’-ends in a single sample were selected. The total number of reads before and after quality control and selected numbers of supporting CAGE tags in each tissue is shown in Supplementary Table 1. The flowchart illustrating the entire flow of the analysis is shown in Fig. 1. Different scenarios were investigated based on the scope of the analysis and the way biological replicates/tissues were taken into account:

The pairwise comparisons between Bos taurus and Bos indicus subspecies for spleen, muscle, and adult liver were carried out by following the workflow in Fig. 1. For this scenario, Bos taurus biological replicates for each tissue were merged together resulting in a single sample that contained a union of CTSSs present in the replicates and raw tag counts for those CTSSs. Then, TSSs were called in each subspecies tissue and then TSSs from Bos taurus and Bos indicus samples were aggregated into a single set of consensus TSS clusters for each tissue. Different TSS properties, including the proportion of divergent TSSs (i.e., TSSs observed only in one subspecies), distance distribution between dominant TSSs across subspecies, and shifting of usage from one TSS to an alternative TSS in the other subspecies were compared (see Differential TSS usage and shifting score calculation).

To investigate the changes in Minor Allele Frequency (MAF) in Bos taurus population vs. Bos indicus population in significant differential TSS regions, all variants within significant differential TSS cluster regions across subspecies was extracted (2,926 loci) from 200 Brahman (Bos indicus) and 1053 Holstein (Bos taurus) animals available from 1000 bull genomes project (Run8). To identify SNP loci that had been under selection, pairwise F_ST-values comparing Bos taurus and Bos indicus was calculated for SNPs with MAF greater than 0.005 using Weir and Cockerham’s method⁵⁴ implemented in vcftools (version 0.1.13)⁵⁵. Then, a Chi-square test was used to identify SNPs with significant F_ST at P-value < 0.05, using the test statistic X² = 2 NF_ST, where 2N = the sum of genotyped gametes in the two populations⁵⁶. Finally, corresponding overall estimates of the false discovery rate (FDR) were calculated for each significant SNP as l*P/d, where l = number of SNP loci tested (667 SNPs), P = the significance level of the individual chi-square tests, and d = the number of SNP loci identified with a significant F_ST (475 SNPs). To test for differences between the MAF of significant SNPs identified using F_ST method (FDR < 0.05) within significant differential TSS cluster compared to within the whole genome, a hypergeometric distribution test was used (P-value < 0.01). The difference in MAF within each of 50 bins, where each bin contained a 1% range of MAF values was tested using hypergeometric probability density function:

where p(x, N, n, m) is the probability of observing x SNPs within a given MAF bin in a sample of 472 SNPs (n) drawn from a Bos indicus (Bos taurus) genome-wide of 34,645,762 (15,870,794) SNPs (N) containing m SNPs within a given MAF bin. The values of x are defined based on the number of significant SNPs identified using F_ST method (FDR < 0.05) in a given MAF bin for each subspecies. The value of m are obtained based on the number of SNPs observed in a given MAF bins for each subspecies. This was done for all 50 MAF bins for both subspecies and P-values were adjusted for multiple testing (FDR < 0.001).

Due to the lack of biological replicates, the experimental design does not allow support of the conclusions from the evolutionary comparative analyses. Consequently, the reported differences obtained from the scenario1 cannot be reliably assigned to the factor of interest (the subspecies). To support our conclusion from evolutionary TSS discovery (scenario1) and get insight about expected variation within subspecies, pairwise comparisons were implemented between the taurus biological replicates for spleen, muscle and liver. TSSs were called in each biological replicate in three tissues and then TSSs from two biological replicates were aggregated into a single set of consensus TSS clusters for each tissues (Fig. 1). Similar to first scenario, the proportion of divergent TSSs (i.e., TSSs observed only in one subspecies), distance distribution between dominant TSSs across biological replicates, and shifting of usage from one TSS to an alternative TSS in the other replicate were compared (see Shifting score measurement paragraph below). This could estimate the expected difference observed between samples from the same subspecies, and potentially help to compare samples from the Bos indicus animal.

The pairwise comparisons between fetal and adult stages for Bos taurus liver, Bos indicus lung and liver were done by following the workflow in Fig. 1. Similar to first scenario, the Bos taurus biological replicates were merged together before TSS calling. Then TSSs called in fetal and adult stage were aggregated into a single set of consensus TSS clusters separately for each tissue and different properties of TSSs between fetal and adult stages were compared.

To investigate the crossover switching event, TSSs called in individual fetal and adult taurus replicates (four samples for each tissue) were aggregated into a single set of consensus TSS clusters. Only genes having at least two consensus TSSs with at least 1 TPM expressed in both replicates were used. The null hypothesis was that there was no switching for the two TSS clusters. The test of this hypothesis was performed using lm() function in R (version 4.0.2) as follows and candidate switching events identified at this preliminary stage if the p-value for interaction term was less than 0.01.

where Y represents the TSS expression, TSS represents the effect of the consensus TSSs (presumed here to be 1 and 2), stage is the effect of developmental stage (fetal and adult), and the interaction term between TSS and stage.

A crossover TSS switching event was detected if one TSS cluster was used more frequently at one developmental stage compared to the other, and that the dominant TSS cluster switches between fetal and adult stages. Corresponding overall estimates of the false discovery rate (FDR) were calculated for all significant genes based on the number of gene tested (388 genes), the significance level of the individual linear regression model for interaction term, the number of genes with crossover TSS switching event.

For this scenario, biological replicates were merged before TSS calling for Bos taurus subspecies. The TSS were called for each tissue and then to be able to compare samples at the level of clusters of TSSs between tissues, TSS clusters from all samples were aggregated into a single set of consensus TSS clusters (Fig. 1) separately for each subspecies. Then, for each supspecies (1) the distance distribution between dominant TSSs, (2) the proportion of TSS clusters with differential TSS usage, and (3) significant ‘shifting’ TSS clusters for pairs of tissues was estimated.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol