Bioinformatics and statistical analysis

MH Matthew Hamilton
SR Stewart Russell
GS Grace M. Swanson
SK Stephen A. Krawetz
KM Karen Menezes
SM Sergey I. Moskovtsev
CL Clifford Librach
request Request a Protocol
ask Ask a question
Favorite

Sequence processing and analyses were performed as previously described26,27. Briefly, reads were first aligned to the telomere-2-telomere (T2T) genome (T2T-CHM13v2.0) followed by processing with the RNA Element Discovery algorithm (REDa), which identifies exon-sized RNA fragments called RNA Elements (REs), originating from both exonic and intronic regions, as well as those 10 KB from a known exon, and those greater than 10 KB from a known exon (novel orphan)26. Reads were normalized per kilobase of exon model per million mapped reads (RPKM) to compare expression levels among samples28. Welch’s one-way ANOVA was used to test for significant differences in RE and read categories among groups.

Transcript integrity was evaluated using the transcript integrity index (TII) algorithm aligned to the Gencode hg38 version 41 genome, as described previously, to evaluate the quality of RNA present in the sperm samples, despite the high levels of fragmentation that are inherent in sperm27. Differences in human genome alignment versions are due to the TII algorithms requirements of running in R version 3.6.0, which inhibits use of the T2T genome. Transcripts were considered intact if the transcript received a TII score greater than 0.5, which indicates at least half of the transcript was covered by a minimum of 5 reads per million (RPM). Samples not meeting TII thresholds were excluded from the analysis. Transcript integrity was also visually confirmed by the UCSC Genome Browser T2T genome track.

Identified REs were filtered to include only those with greater than 1 RPKM in at least a third of the samples to remove those with low abundance. Mfuzz (version 2.54.0) clustering of group median RPKM was undertaken in R, set for 8 clusters with a membership probability cutoff of 0.4. A Kruskal Wallis test in R was used to apply a p-value for each RE that was present in the cluster patterns. Principal component analysis (PCA) of significant REs was undertaken and a plot was generated using Clustvis (https://biit.cs.ut.ee/clustvis/). A singular value decomposition PCA method was used with imputation of missing values. The Pareto scaling approach was used to scale rows by the square root of the standard deviation.

Gene ontology enrichment analysis was completed using the following programs: EnrichR (https://maayanlab.cloud/Enrichr/); GeneMANIA (cytoscape module version 3.5.2); STINGdb (version 11.5); MsigDB (Human MSigDB v2023.1.Hs); and Metascape (v3.5.20230501). For patterns containing multiple clusters, ontology analysis was undertaken using these major patterns, rather than individual clusters. Significant Mfuzz RE-associated RNAs were compared to miRNA target gene expression. TargetScan and miRtarbase were used to determine and catalog target gene names for previously reported miRNAs22. Significant REs were cross-referenced with paternally provided RE lists from Estill et al., 2019, as modified by Swanson, GM et al., 2023 (5 × enriched and 2 × enriched in sperm compared to oocyte)26,30.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A