We sought to describe the genomic neighborhoods of all 1.6 million STR regions identified in the hipSTR reference in terms of their density of key annotated features–-in particular, of coding genes, common SNPs, trait-associated variants, and DNase I hypersensitivity sites. Before doing so, we preprocessed the feature data from UCSC in various ways.
For coding gene locations, we used the RefSeq Select set, which contains one entry per curated coding gene (21,432 genes). We also located the transcription start site (TSS) of each gene as either the start or end coordinate of transcription, depending on whether the gene was annotated on the + (TSS = start) or − (TSS = end) strand. To identify SNPs common in people of European ancestries, heavily represented in GWAS (Martin et al., 2019; Popejoy & Fullerton, 2016), we filtered to SNPs with minor allele frequency 1% or larger in the HapMap CEU data, reducing the number of variants from 4,029,798 to 2,705,918. We limited ClinVar variants to those classified as “Pathogenic,” reducing from 1,491,509 variants to 113,412. For DNase I hypersensitivity sites, we limited to sites with the highest signal level (score 1000/1000), reducing the number of sites from 1,949,038 to 160,870.
For the GWAS catalog, we preprocessed in two distinct ways. The GWAS catalog contains one row per unique combination of SNP locus (rsid), study (PubMed ID), and trait, for a total of 392,271 entries. To obtain information about the number of SNPs identified as trait-associated in any GWAS, we first filtered the GWAS catalog to contain only one row per SNP locus, reducing to 183,014 rows. Thus, for counts of numbers of GWAS hits, each SNP rsid counts only once, regardless of how many studies identified it, and regardless of how many traits it was associated with. Next, we sought to identify traits with nearby GWAS associations for each STR. The trait identifiers in the GWAS catalog are not standardized, and many similar traits receive distinct names (for example “HDL cholesterol” and “HDL cholesterol levels” or “Mean corpuscular hemoglobin” and “Mean corpuscular hemoglobin concentration”). To reduce this redundancy and focus on commonly studied traits when counting the number of distinct traits near each STR, we limited to traits with associations reported in at least three distinct studies with the exact same trait name. This reduced the number of traits from 10,399 to 493.
For all features and all STRs, we recorded the distance of the nearest feature to the STR midpoint, and the number of features within 1kb, 10kb, and 100kb of the STR midpoint. For coding gene locations, we kept track of distance to the nearest gene (defined as the distance to the start or end of transcription, whichever is shorter, or 0 if the STR is intragenic) and the nearest TSS separately. For the GWAS catalog, we kept track of the number of GWAS hits within each distance window as well as the number of distinct associated traits (where again, distinctness merely means a non-identical character string). Because of the large size of the dbSNP common variants catalog, we recorded these locations only for the 20 CODIS markers. Additionally, for the CODIS only, we recorded the names of the traits reported as associated in ClinVar and the GWAS catalog, as well as the names of nearby protein-coding genes.
The data processing and analysis scripts, written in R (v. 4.1.2, R Core Team, 2021) and using the data.table package (Dowle et al., 2019), are available at https://github.com/edgepopgen/CODIS_proximity. The output files recording the features proximal to each STR are available in supplementary files.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.