2.3. Simulation Study

YG Yingjie Guo
CW Chenxi Wu
MG Maozu Guo
QZ Quan Zou
XL Xiaoyan Liu
AK Alon Keinan
request Request a Protocol
ask Ask a question
Favorite

To evaluate the accuracy of SGL-LMM and pervious methods for association mapping, we considered a semi-empirical example based on the genotypic and phenotypic data for up to 1307 world-wide accessions of Arabidopsis thaliana from Atwell et al. (2010). The data can be downloaded from https://github.com/Gregor-Mendel-Institute/atpolydb. Based on the quality control provided by GWAS, we excluded a SNP if its Minor Allele Frequency (MAF) was < 0.05, if its missing rate was >0.05 of the population, or its allele frequencies were not in Hardy-Weinberg equilibrium (P < 0.0001). After filtering, there were 200155 SNPs left.

To simulate the effect of population structure, we used the real phenotypic leaf number at flowering time (LN,16°C,16 h daylight) which is available for 177 plants of the 1307 plants of A.thaliana. Univariate analyses showed that the phenotype had an excess of associations when population structure was not taken into account (Atwell et al., 2010). After correction for population effect, the p-values are approximately uniformly distributed, Which means this phenotype is totally subjected to population structure. Hence, we use this phenotype to simulate the confounding effect. First, to determine the fraction δ of genetic and residual variance, we fit a random effects model to LN, which we subsequently used to predict the population structure for the remaining 1,130 plants. We run the random effect model multiple times, and choose the final dataset which the difference of genetic variance parameter between real and synthetic data are less than 0.0001. In addition to this empirical background, we added simulated association with different effect sizes and a range of complexities of genetic models.

We then simulated the phenotype as follows:

where ysig=Xkβ, Xk is the genotype data for the k causal SNPs. By introducing the group structure, we consider a case with Ng = 200 genes(groups) on the chromsome1 which covered 2000 SNPs, we set m groups to be active. We vary the sparsity level of the active groups to get the total active SNPs to be k. β~N(0,I) and φ~N(0,I). During the simulation, we maintained the original LD structure in each gene.

The initial setting used for simulation were 3 active groups each containing 5 effective SNP (k = 15 and m = 3). To investigate the influence of the confounding effect strength and the overall noise, we considered varied σpop ∈ {0.5, 0.7, 0.9} and σsig ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For each combination of σpop and σsig, we generate 10 datasets, resulting in 120 datasets in total for the 12 combinations.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A