Data processing

Vivian Link; Yuómi Jhony A. Zavaleta; Rochelle-Jan Reyes; Linda Ding; Judy Wang; Rori V. Rohlfs; Michael D. Edge

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Data processing

VL Vivian Link

YZ Yuómi Jhony A. Zavaleta

RR Rochelle-Jan Reyes

LD Linda Ding

JW Judy Wang

RR Rori V. Rohlfs

ME Michael D. Edge

This method is extracted from research article: bioRxiv, Mar 2023

Microsatellites used in forensics are located in regions unusually rich in trait-associated variants

DOI: 10.1101/2023.03.07.531629

Ask a question

Favorite

We sought to describe the genomic neighborhoods of all 1.6 million STR regions identified in the hipSTR reference in terms of their density of key annotated features–-in particular, of coding genes, common SNPs, trait-associated variants, and DNase I hypersensitivity sites. Before doing so, we preprocessed the feature data from UCSC in various ways.

For coding gene locations, we used the RefSeq Select set, which contains one entry per curated coding gene (21,432 genes). We also located the transcription start site (TSS) of each gene as either the start or end coordinate of transcription, depending on whether the gene was annotated on the + (TSS = start) or − (TSS = end) strand. To identify SNPs common in people of European ancestries, heavily represented in GWAS (^{Martin et al., 2019}^;^{Popejoy & Fullerton, 2016}), we filtered to SNPs with minor allele frequency 1% or larger in the HapMap CEU data, reducing the number of variants from 4,029,798 to 2,705,918. We limited ClinVar variants to those classified as “Pathogenic,” reducing from 1,491,509 variants to 113,412. For DNase I hypersensitivity sites, we limited to sites with the highest signal level (score 1000/1000), reducing the number of sites from 1,949,038 to 160,870.

For the GWAS catalog, we preprocessed in two distinct ways. The GWAS catalog contains one row per unique combination of SNP locus (rsid), study (PubMed ID), and trait, for a total of 392,271 entries. To obtain information about the number of SNPs identified as trait-associated in any GWAS, we first filtered the GWAS catalog to contain only one row per SNP locus, reducing to 183,014 rows. Thus, for counts of numbers of GWAS hits, each SNP rsid counts only once, regardless of how many studies identified it, and regardless of how many traits it was associated with. Next, we sought to identify traits with nearby GWAS associations for each STR. The trait identifiers in the GWAS catalog are not standardized, and many similar traits receive distinct names (for example “HDL cholesterol” and “HDL cholesterol levels” or “Mean corpuscular hemoglobin” and “Mean corpuscular hemoglobin concentration”). To reduce this redundancy and focus on commonly studied traits when counting the number of distinct traits near each STR, we limited to traits with associations reported in at least three distinct studies with the exact same trait name. This reduced the number of traits from 10,399 to 493.

For all features and all STRs, we recorded the distance of the nearest feature to the STR midpoint, and the number of features within 1kb, 10kb, and 100kb of the STR midpoint. For coding gene locations, we kept track of distance to the nearest gene (defined as the distance to the start or end of transcription, whichever is shorter, or 0 if the STR is intragenic) and the nearest TSS separately. For the GWAS catalog, we kept track of the number of GWAS hits within each distance window as well as the number of distinct associated traits (where again, distinctness merely means a non-identical character string). Because of the large size of the dbSNP common variants catalog, we recorded these locations only for the 20 CODIS markers. Additionally, for the CODIS only, we recorded the names of the traits reported as associated in ClinVar and the GWAS catalog, as well as the names of nearby protein-coding genes.

The data processing and analysis scripts, written in R (v. 4.1.2, ^{R Core Team, 2021}) and using the data.table package (^{Dowle et al., 2019}), are available at https://github.com/edgepopgen/CODIS_proximity. The output files recording the features proximal to each STR are available in supplementary files.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol