Maximum information coefficient (MIC)

QX Qinqing Xiong
WW Wenju Wang
MW Mingya Wang
CZ Chunhui Zhang
XZ Xuechun Zhang
CC Chun Chen
MW Mingshi Wang
request Request a Protocol
ask Ask a question
Favorite

Predictors are critical to the model’s prediction performance, and too many irrelevant variables or missing key variables will affect the prediction accuracy.24 Predictor screening is mainly divided into trial-and-error26 and analytical methods,38 although the trial-and-error method is simple to operate, the amount of arithmetic is very large and does not reflect the relationship between predictors and accuracy, while the analytical method based on factor correlation is better than the trial-and-error method. However, Pearson and Spearman correlation coefficients are only sensitive to linear relationships and cannot effectively capture the nonlinear relationships between both meteorological factors and precursors and ozone. Although mutual information (MI) has good performance in analyzing nonlinear relationships between variables, the probability density functions of the variables are unknown and the mutual information is difficult to estimate.39,40 In contrast, MIC is applicable to any functional relationship, whether linear or nonlinear, and the outliers of the variables have less impact on the results. Therefore, this study used the maximum information coefficient (MIC) to screen out factors with some correlation with ozone as predictors.

Reshef proposed the maximum information coefficient (MIC) to analyze the nonlinear correlation of big data.27 MIC is calculated by mutual information and grid division. Mutual information is an important indicator for determining the degree of correlation between variables, and it is defined as (Equation 5), (Equation 6), (Equation 7), (Equation 8):

Where A = {ai, i = 1, 2, ···, n}; B = {bi, i = 1, 2, ···, n}; n denotes the number of samples; The joint probability density of A and B is p(a, b); The marginal probability densities of A and B are denoted by p(a) and p(b), respectively; MIC is the maximum information coefficient; D/G denotes that data D is divided using G; M(D)x, y is the maximum normalized MI value obtained by dividing a feature matrix into different divisions; B(n) is the upper limit of grid division x×y, which is generally defined as ω(1)≤B(n)≤O(n1−ε), 0<ε < 1.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A