Model II: Dirichlet Process Gaussian mixture model (DP-GMM)

GL Gengxin Li
request Request a Protocol
ask Ask a question
Favorite

Model I is a fast and efficient genotyping model for SNPs having large values of MAF. In real experiments, many SNPs with low MAF may result in the disappearance of one or two genotype clusters. Also even though some SNPs with low MAF display three genotype groups, some clusters may lack sufficient data to support and recognize. In this case, Model II, DP Gaussian Mixture Model, is motivated by the need to carry out the model selection for SNPs with an uncertain number of genotype clusters [24]. Generally speaking, this is a nonparametric Bayesian method that potentially allows a flexible number of mixture components and also provides estimates for the mixture component parameters and the relevant mixing proportions.

A DP Gaussian Mixture Model [24] fits the pair of raw intensity xis into K-component Gaussian Mixture Model with K approaching a large number. The model is expressed as,

where K is the total number of clusters. Θs=(πs, μs, Σs) denotes the unknown parameters at the sth SNP where πs=(π1s,..., πKs), μs=(μ1s,..., μKs), and Σs=(Σ1s,..., ΣKs). Generally, the number of observations within the sth SNP (ns) are partitioned into K components (n1s, n2s,..., nKs) with relevant mixing proportions (π1s, π2s,..., πKs). The distribution of n1s, n2s,..., nKs follows a multinomial distribution and its probability mass function is written by,

where ns = k=1Knks denotes the total number of individuals at the sth SNP. Then each pair of raw intensity for the sth SNP xis has its own indicator zis (i = 1,..., ns), and the distribution of indicator variables is expressed as,

The model can then be expressed as:

where α is the DP concentration parameter and can be thought as the inverse variance of DP. The distribution of the reciprocal of α follows a Gamma distribution with 1 degree freedom and mean 1. K is the maximum number of clusters, then πs is distributed with a symmetric Dirichlet distribution with parameter αK. m and r are hyperparameters being the mean and relative precision of μks, and the hyperparameters ν and S−1 are degrees of freedom and inverse mean of Rks where Rks follows a Wishart distribution with parameters ν and S−1, respectively.

The inference on Model II relies on the posterior distribution of each parameter conditional on all other parameters, then the parameters, hyperparameters and indicator variables are repeatedly sampled from their posterior distributions. In particular, the conditional posterior probabilities are proportional to the likelihood function multiplying priors. Then the posterior probabilities of the cluster indicator variable zis conditional on all other variables are expressed as:

Note that p(xis|μks,Rks) and p(μks,Rks|m,r,ν,S) are the likelihood function and the joint function of parameters (μks and Rks), respectively. Once the optimal genotype clusters and their relevant component parameters are obtained, two measurements Posterior Rate (PR) and the Average Posterior Rate (APR) measuring the quality of the sth SNP can be calculated in the similar way.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A