Polygenic score calculations in the GRAS data collection

VB V. Bansal
MM M. Mitjans
CB C. A. P. Burik
RL R. K. Linnér
AO A. Okbay
CR C. A. Rietveld
MB M. Begemann
SB S. Bonn
SR S. Ripke
RV R. de Vlaming
MN M. G. Nivard
HE H. Ehrenreich
PK P. D. Koellinger
request Request a Protocol
ask Ask a question
Favorite

PGS were calculated using PLINK version 1.944,45. We calculated eight different scores, which are described below. Supplementary Fig. 6 shows the distribution of the MAFs of all SNPs that were included in these PGS.

SZ scores: We received the GWAS summary statistics for SZ from the PGC excluding the data from our replication sample (GRAS). We constructed a PGS using the 132 EA lead SNPs (PEA < 10−5) that are also nominally associated with SZ (PSZ < 0.05). This score (SZ_132) is used for replication of the proxy-phenotype analyses.

Furthermore, we constructed a PGS using all 8,240,280 SNPs that survived quality control (SZ_all). Next, we applied the clumping procedure using r² > 0.1 and 1000 kb as the clumping parameters and the 1000 Genomes phase 1 version 3 European reference panel to estimate LD among SNPs, eventually leaving a set of 349,357 SNPs ready for profile scoring. The vast majority of the SNPs we included in the SZ PGS (79%) have MAF > 1% and MAF < 99% (Supplementary Fig. 6).

For secondary analyses on the prediction of SZ symptoms, we constructed two PGS using the 349,357 SNPs that have concordant (+ and +; or – and –) or discordant signs (+ and –; or – and +) for EA and SZ, respectively. This resulted in 174,734 and 174,623 independent SNPs with concordant or discordant, respectively. We used these approximately independent SNPs for profile scoring and call the resulting PGS Concordant and Discordant. We note that this approach of constructing the Concordant and Discordant scores is based on a very conservative LD-pruning algorithm because we first LD-pruned all SNPs that survived quality control for both phenotypes and then sort them based on sign concordance. Thus, this approach prunes for LD both within and across the Concordant and Discordant scores.

As an alternative, we also used a less conservative approach that only prunes for LD within scores. Specifically, this alternative approach first sorts all SNPs that pass quality control for both phenotypes (8,240,280 SNPs) based on sign concordance (4,147,926 SNPs with concordant and 4,092,354 SNPs with discordant signs) and then LD-prunes the resulting set, yielding 260,441 and 261,062 independent SNPs with concordant or discordant, respectively. We call the PGS resulting from this approach Concordant_more_SNPs and Discordant_more_SNPs.

To enable tests for the sensitivity of our results to the MAF distribution of SNPs included in our scores, we also constructed PGS using SNPs with MAF > 1% (i.e. dropping SNPs with allele frequency ≤0.01 or ≥0.99) and with MAF > 10% (i.e. dropping SNPs with allele frequency ≤0.10 or ≥0.90), respectively, from the 8,240,280 SNPs that survived quality control. Next, we used r2 > 0.1 and 1000 kb as the clumping parameters and the 1000 Genomes phase 1 version 3 European reference panel to estimate LD among SNPs, eventually leaving a set of 302,150 approximately independent lead-SNPs with MAF > 1% and 108,075 SNPs with MAF > 10% ready for profile scoring. The resulting scores are called SZ_all_1% and SZ_all_10%, respectively. We also constructed PGS from this set of SNPs that took the sign concordance between EA and SZ into account and call these PGS Concordant_1% (151,265 SNPs), Discordant_1% (150,885 SNPs) and Concordant_10% (54,287 SNPs), Discordant_10% (53,788 SNPs), respectively.

All scores were calculated using the –score function in PLINK using the natural log of the OR of the SNPs for SZ as effect sizes.

Educational attainment scores: Beta coefficients for the EA GWAS were approximated using βj ^ = zjNj*2*MAFj*(1-MAFj), see Rietveld et al.47 for the derivation. Using these betas, we constructed a PGS using the 132 EA lead SNPs (PEA < 1 × 10−5) that are also nominally associated with SZ (PSZ < 0.05). The resulting score is called EA_132.

Furthermore, we constructed a PGS using all 8,240,280 SNPs that survived quality control. Next, we applied the clumping procedure using r2 > 0.1 and 1000 kb as the clumping parameters and the 1000 Genomes phase 1 version 3 European reference panel to estimate LD among SNPs, eventually leaving a set of 348,429 SNPs ready for profile scoring (Supplementary Fig. 6). The resulting score is called EA_all. Eighty percent of the SNPs we included in the EA PGS have MAF > 1% & MAF < 99% (Supplementary Fig. 6).

We also constructed PGS using SNPs with MAF > 1% i.e. dropping SNPs with allele frequency ≤0.01 or ≥0.99) and with MAF > 10% (i.e. dropping SNPs with allele frequency ≤0.10 or ≥0.90), respectively, from the 8,240,280 SNPs that survived quality control. Next, we used r2 > 0.1 and 1000 kb as the clumping parameters and the 1000 Genomes phase 1 version 3 European reference panel to estimate LD among SNPs, eventually leaving a set of 306,977 approximately independent lead-SNPs with MAF > 1% and 106,607 SNPs with MAF > 10% ready for profile scoring. The resulting scores are called EA_all_1% and EA_all_10%, respectively.

BIP score: We obtained GWAS summary statistics on BIP from the PGC58. We used the LD-pruned GWAS summary from PGC (‘pgc.bip.clump.2012–04.txt’) with a set of 108,834 LD-pruned SNPs ready for profile scoring. PGS for the GRAS data collection were calculated by the application of the –score function in PLINK using the natural log of the OR. The resulting score is called BIP_all.

Neuroticism score: We obtained GWAS summary statistics on Neuroticism from the SSGAC. The results are based on the analyses reported in Okbay et al.26 containing 6,524,432 variants. We applied the clumping procedure using r2 > 0.1 and 1000 kb as the clumping parameters and the 1000 Genomes phase 1 version 3 European reference panel to estimate LD among SNPs, eventually leaving a set of 232,483 SNPs ready for profile scoring (Supplementary Fig. 6). PGS for the GRAS data collection were calculated by the application of –score function in PLINK using the Neuroticism beta values. The resulting score is called Neuro_all.

Note that our replication sample (GRAS) was not included in the GWAS summary statistics of any of these traits.

Polygenic score correlations: We calculated Pearson correlations between all PGS that we constructed in the GRAS data collection (SZ_all, SZ_132, EA_all, EA_132, Concordant, Discordant, BIP_all and Neuro_all). Results for SZ patients and healthy controls together are reported in Supplementary Data 12a. We found very similar results among the SZ cases (Supplementary Data 12b) and healthy controls when we analysed them separately from each other (Supplementary Data 12c). These results were used to inform the correct multiple regression model specification for the polygenic prediction analyses (Supplementary Note 6 and 8).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A