To evaluate the goodness-of-fit of the inferred network in a prediction error framework one needs to balance the measurement and process errors. This optimization occurs during the leave one out procedure (Algorithm 1), using the CVX convex optimization package (v1.22)33 for MATLAB, where the left out gene (g) is expressed as a linear combination of the other experiments (cross-validation). The aim of this procedure is to equally balance the measurement and process errors when predicting the left out gene under cross-validation.
In the BalanceFitError algorithm, A contains the inferred GRN structure (topology), with each non-zero value representing a regulatory interaction and each zero a lack of interaction, i.e. pseudo-direct influence. The algorithm estimates each gene’s perturbation and response based on the balanced measurement and process errors of all other genes and compares it to the intended perturbation and observed response. Since error is a function of the degrees of freedom of the given matrix, relative error (Erel, Frel) is used to more equally balance these errors. From step (i) to (iii) all matrices have the perturbation experiments of the left out gene g removed and are thus denoted !g. However, A maintains all genes, remaining square throughout, and later the left out experiments can be predicted from the remaining data. This method to evaluate the goodness-of-fit is used on all inferred networks. All inference methods used here have a regularization parameter that determines the number of nonzero parameters in the models, which is varied to span the complete range from empty to full network.
Because our leave out procedure assesses individual gene prediction errors, we assembled null GRN performance distributions by shuffling GRN links and fitting these new networks to the data to create both a standardized and fairly conservative link weight using a constrained least squares (CLS) algorithm30,34. To this end we implement a Monte Carlo sampling method, sampling links to maintain the node in degree and preserving hubs thereby approximating an estimated link null distribution based on the inferred GRN judged to generate conservative and fair null GRNs. For a fair comparison, both the inferred and shuffled GRNs are fit to the original data. However topology and sign are preserved in the GRNs. To obtain a measure of the goodness of fit of both inferred and shuffled GRNs, cross-validation was used to calculate the weighted residual sum of squares (wRSS) of the original training data while balancing the measurement and process errors as described in Algorithm 1. We are able to predict a left out gene in step c and d by expressing it as a linear combination of the other genes. This goodness of fit measure was also made of the inferred GRN’s ability to under cross-validation predict the original data compared to the distribution of prediction errors using shuffled data. The relative error metric comparing measured and shuffled wRSS (Figs. 2, ,3,3, S3, S4) is complemented by R2 values (Fig. S5). Before calculating these, each GRN parameters were modified to ensure that the predicted response remained similarly bounded as the observed gene expression. This was done by performing singular value decomposition of the GRN, setting singular values below a cutoff to zero, and then reconstructing the GRN without the smallest singular values. This GRN was then fit to the training data under cross-validation to generate predicted expression responses in the same way as described above. The cutoff on the minimum singular value was set independently for each GRN to ensure that the predicted expression values were within the range of the measured values. The small singular values generally represent noise if the data is ill-conditioned and removing them reduces the effect of noise.
Validation of inferred GRNs’ topologies. Each x-axis tick mark shows the prediction performance in terms of the wRSS error of each inferred GRN topology (circles) fit to training data under cross-validation, compared to its shuffled topologies. The box displays the median and interquartile range, and whiskers bound points maximally extending 1.5 times this range. Beyond this, outlier points are shown.
Validation of the inferred GRNs’ fit to the measured data. Each x-axis tick mark shows the prediction performance in terms of the wRSS of an inferred GRN topology fit to training data under cross-validation, compared to its ability to fit shuffled data. X marks represent the inferred GRNs. The filled color box displays the median and interquartile range, and whiskers bound points maximally extending 1.5 times this range. Beyond this, outlier points are shown.
To further verify both predictiveness and generalizability, these GRNs are also applied to a second independent, validation dataset based on the same genes knocked down in pairs, in single replicates. While this data is not used to infer GRNs, we apply the same cross-validation strategy as for the original data to validate the GRNs. This is necessary for parameter fitting, since the process error is different from the single knockdown data. Furthermore, by running the same pipeline we obtain a comparable measure of how well the independent data fits our inferred GRNs, and build null distributions of expected error from shuffled GRNs to examine an inferred GRNs’ ability to predict the data.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.