Methods for model selection

AS Andrew J. Sedgewick
IS Ivy Shi
RD Rory M. Donovan
PB Panayiotis V. Benos
request Request a Protocol
ask Ask a question
Favorite

K-fold cross-validation (CV) [14] splits the data into K subsets and holds each set out once for validation while training on the rest. We use K = 5 and average the negative log-pseudolikelihood of the test sets given the trained models. The Akaike information criterion (AIC) [15] and Bayes information criterion (BIC) [16] are model selection methods that optimize the likelihood of a model based on a penalty on the size of the model represented by degrees of freedom. To calculate the AIC and BIC, we substitute the pseudolikelihood for the likelihood and we define the degrees of freedom of the learned network as follows.

In the standard lasso problem, the degrees of freedom is simply the number of non-zero regression coefficients [17]. So, in the continuous case, the degrees of freedom of a graphical lasso model is the number of edges in the learned network. In the mixed case, edges incident to discrete variables have additional coefficients corresponding to each level of the variable. Lee and Hastie’s MGM uses group penalties on the edge vectors, ρ, and matrices, ϕ, to ensure that all dimensions sum to zero. So, in the model, an edge between two continuous variables adds one degree of freedom, and edge between a continuous variable and a categorical variable with L levels adds L-1 degrees of freedom, and an edge between two discrete variables with Li and Lj levels adds (Li – 1)(Lj - 1) degrees of freedom.

We compare these model selection methods to an oracle selection method. For the oracle model, we select the sparsity parameters that minimize the number of false positives and false negatives between the estimated graph and the true graph. While we do not know the true graph in practice and none of the other methods use the true graph, this method shows us the best possible model selection performance under our experimental conditions.

AIC, BIC, and CV all require calculating the pseudolikelihood from a learned model so to optimize over separate sparsity penalties for each edge type, we perform a cubic grid search of λcc, λcd, and λdd over {.64, .32, .16, .08, .04}.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A