We assume the dense representations of patients’ EHRs (i.e. patient embeddings) over time follow a Gaussian process:
Since the feature embeddings are engineered to approximately follow a multivariate normal distribution as described in the Producing feature embeddings section of the Supplementary Materials, it is reasonable to assume to be a Gaussian process over time . We further specify the mean and covariance functions and respectively. For some parameters , we assume:
In summary, we assume that patient ’s expected embedding at time , , is a function of , , and . We assume that the marginal variance of embedding component can be represented by some baseline scaled by . We denote the correlation between embedding components and as , which we assume to be constant over time. Between timepoints, we employ a first-order univariate autoregressive (AR(1)) kernel structure such that the residual at , , is a linear function of its preceding value with autocorrelation coefficient :
is an autoregression regularization hyperparameter separately tuned via fivefold cross-validation maximizing the AUROC of predictions: ignores intertemporal correlation while denotes undampened autoregression. We chose first-degree autoregression over higher-degree models due to computational ease and mitigation of overfitting. We provide a sensitivity analysis with respect to the choice of k-fold cross-validation in Supplementary Fig. S2 that demonstrates no significant effect of k on predictive accuracy.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.