Training database generation
This protocol is extracted from research article:
Deep neural network processing of DEER data
Sci Adv, Aug 24, 2018; DOI: 10.1126/sciadv.aat5218

Neural network training requires a library of inputs and their corresponding outputs covering a range that is representative of all possibilities (48, 49, 59). Real distance distributions between spin labels are rarely known exactly and, therefore, collating experimental data is not an option. Fortunately, high-accuracy simulations, taking into account most of the relevant effects, have recently become possible (19, 25, 60). They can be time-consuming (19) but only need to be run once to generate multiple simulated DEER traces with different artificial noise and background functions. These traces are then stored in a database alongside the “true” distance distributions they were generated from. An example is shown in Fig. 2.

(Left) Randomly generated distance distribution. (Right) The corresponding DEER form factor (purple), a randomly generated noise track (yellow), a randomly generated intermolecular background signal (red, marked BG), and the resulting “experimental” DEER signal (blue). a.u., arbitrary units.

The size and shape of the training database are entirely at the trainer’s discretion—a wide variety of spin systems, parameter ranges, secondary interactions, and instrumental artifacts may be included. This exploratory work uses the DEER kernel for a pair of spin-½ particles, but the DEER simulation module in Spinach is not restricted in any way (60)—training data sets may be generated for any realistic combinations of spins, interactions, and pulse frequencies. The following parameters are relevant:

(1) Minimum and maximum distances in the distribution. Because the dipolar modulation frequency is a cubic function of distance, there is a scaling relationship between the distance range and the signal durationEmbedded Image(10)The salient parameter here is the “dynamic range”—the ratio of the longest distance and the shortest. Training signals must be long enough and discretized well enough to reproduce all the frequencies present.

(2) Functions used to represent distance peaks and their number. A random number of skew normal distribution functions (61) with random positions within the distance interval and random full widths at half magnitude were used in this workEmbedded Image(11)where σ is the SD of the underlying normal distribution, x0 is the location of its peak, and α is the shape parameter regulating the extent of the skew. Distance distributions were integrated with the DEER kernel in Eq. 2 to obtain DEER form factors. We found that generating distance distributions with up to three peaks was sufficient to ensure that the networks could generalize to an arbitrary number of distances (see the “Measures of uncertainty” section).

(3) Noise parameters and modulation depth. Because DEER traces were recorded in the indirect dimension of a pseudo-2D experiment, the noise was not expected to be colored—this was confirmed by experiments (36). We used Gaussian white noise with the SD chosen randomly between zero and a user-specified fraction of the modulation depth, which was also chosen randomly from within the user-specified ranges.

(4) Background function model and its parameters. We used Eq. 5 with the dimensionality parameter selected randomly from the user-specified range.

(5) Discretization grids in the time and the distance domains. The point count must be above the Nyquist condition for all frequencies expected within the chosen ranges of other parameters. The number of discretization points dictates the dimension of the transfer matrices and bias vectors in Eq. 8, which, in turn, determine the minimum training set size.

(6) Training set size. A fully connected neural network with n layers of width k has n(k2 + k) parameters. Each of the “experimental” DEER traces is k points long, meaning that n(k + 1) is the absolute minimum number of DEER traces in the training set. At least 100 times that amount is in practice necessary to generate high-quality networks.

The parameter ranges entering the training data set are crucial for the success of the resulting network ensemble—the training data set must be representative of the range of distances, peak widths, noise amplitudes, and other attributes of the data sets being processed. The parameters entering the current DEERNet training database generation process are listed in Table 1.

Where a maximum value and a minimum value are given, the parameter is selected randomly within the interval indicated for each new entry in the database. Ranges in the suggested values indicate recommended intervals for the corresponding parameter.

Reliable neural network training requires signals in the database to be consistently scaled and to fall within the dynamic range of the transfer functions. The peak amplitude of each distance distribution was therefore brought by uniform scaling to 0.75, and all DEER traces were uniformly scaled and shifted so as to have the first point equal to 1 and the last point equal to 0.

The training process requires vast computing resources, but using the trained networks does not. For the networks and databases described in this communication, the training process for a 100-network ensemble takes about a week on a pair of NVidia Tesla K40 cards. Once the training process is finished, the networks can be used without difficulty on any computer strong enough to run MATLAB.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.