The cryoDRGN method performs heterogeneous cryo-EM reconstruction by learning a neural network representation of 3D structure. In particular, we use a positionally-encoded multilayer perceptron (MLP) to approximate the function V: ℝ3+n → ℝ, which models structures as generated from an n-dimensional continuous latent space. We refer to this architecture as a coordinate-based neural network41,42 as we explicitly model the volume as a function of Cartesian coordinates.
Without loss of generality, we model volumes on the domain [−0.5,0.5]3. Instead of directly supplying the 3D Cartesian coordinates, k, to the deep coordinate network, coordinates are featurized with a fixed positional encoding function43 consisting of sinusoids whose wavelengths follow a geometric progression from 1 up to the Nyquist limit:
where D is set to the image size1 used in training. Empirically, we found that excluding the highest frequencies of the positional encoding led to better performance when training on noisy data, and we provide an option to modify the positional encoding function by increasing all wavelengths by a factor of 2π.
This neural representation of 3D structure is learned via an image-encoder/volume-decoder architecture based on the variational autoencoder (VAE)30,44. We follow the standard image formation model in single particle cryo-EM where observed images are generated from projections of a volume at a random unknown orientation, R ∈ SO(3). We use an additive Gaussian white noise model. Volume heterogeneity is generated from a continuous latent space, modeled by the latent variable z, where the dimensionality of z is a hyperparameter of the model.
Given an image X, the variational encoder, qξ(z|X), produces a mean and variance, μz|X and Σz|X, statistics that parameterize a Gaussian distribution with diagonal covariance, as the variational approximation to the true posterior p(z|X). The prior on the latent variable is a standard normal distribution (0, I). The positionally-encoded MLP is used as the probabilistic decoder, pθ(V| k, z), and models structures in frequency space. Given Cartesian coordinate k ∈ ℝ3 and latent variable z, the probabilistic decoder predicts a Gaussian distribution over V(k, z). The encoder and decoder are parameterized with fully connected neural networks with parameters ξ and θ, respectively.
Since 2D projection images can be related to volumes as 2D central slices in Fourier space29, oriented 3D coordinates for a given image can be obtained by rotating a D × D lattice spanning [−0.5,0.5]2 originally on the x-y plane by R, the orientation of the volume during imaging. Then, given a sample out of qξ(z|X) and the oriented coordinates, an image can be reconstructed pixel-by-pixel through the decoder. The reconstructed image is then translated by the image’s in-plane shift and multiplied by the CTF before it is compared to the input image. The negative log likelihood of a given image under our model is computed as the mean square error between the reconstructed image and the input image. Following the standard VAE framework, the optimization objective is a variational lower bound of the model evidence:
where the first term is the reconstruction error estimated with one Monte Carlo sample, the second term is a regularization term on the latent representation, and β is an additional hyperparameter, which we set by default to 1/|z|. By training on many 2D slices with sufficiently diverse orientations, the 3D volume can be learned through feedback from the 2D views. For further details, we refer the reader to a preliminary version of the method described in the proceedings of the International Conference for Learning Representations41. The results presented here employ the training regime described in Zhong et al. using previously determined poses from a consensus reconstruction41.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.