The neural network model weights and biases from Section 2.3 were extracted and saved into a CSV file, and C code was then generated for the neural network model using the Keras2c library (Conlin et al., 2021 ▸). The C code was converted to C++ code for use in Buccaneer. As part of this work, the Keras2c library was extended to support the masking layer. The results from the Keras2c library were validated against the Keras Python framework.
As shown in Fig. 2 ▸, the neural network model is used in the joining step of Buccaneer, after the fragments built by Buccaneer in earlier steps have been split into tripeptides and before Buccaneer performs its tracing substep. The role of the neural network is to partition the set of tripeptides into a subset of ‘favourable’ fragments for use in the tracing substep and a subset of ‘unfavourable’ fragments that are disregarded (i.e. are not used for this tracing). To this end, a threshold is applied to the outputs of the neural network such that tripeptides are deemed ‘favourable’ if their associated neural network outputs (i.e. estimate scores of being ‘favourable’) are above this threshold. To improve the likelihood of producing a good protein model, multiple thresholds are used to generate a small set of such models and a decision tree developed by our project is employed to select the best of these models at the end of the Buccaneer model-building cycle.
Creating the training data sets, the neural network architecture and the use of the neural network in Buccaneer.
Two mechanisms for determining the thresholds were developed. The first mechanism is to set a fixed number of thresholds (for example ten thresholds) to divide the score range into equal intervals. The second mechanism is to use the Freedman–Diaconis rule to determine the number of the thresholds based on the score distribution (Freedman & Diaconis, 1981 ▸). The Freedman–Diaconis rule can be calculated as
where IQR is the difference between the third and first quartiles and n is the number of samples. The bin width is used to split the score range and determines the thresholds.
A model will be built for each threshold by eliminating the tripeptides that have scores lower than this threshold. Moreover, we run either one or two Buccaneer confirmation building cycles to estimate how this protein structure will evolve in the next building cycles and then pick the best model.
A decision tree was trained to predict the best indicators to use in picking the best model (from the models built at different thresholds) using the Weka framework version 3.8.5 (Eibe et al., 2016 ▸). Reduced error pruning (REP) was used to simplify the decision-tree size by replacing leaves with the most predicted class, and this change is kept if the performance of the tree is not negatively affected (Elomaa & Kaariainen, 2001 ▸). The decision tree is used separately from the neural network to pick the best model. The training data set for the decision tree was obtained by running Buccaneer using two different seeds with no neural network, as using nondefault seeds led to changes in the model. Using a nondefault seed leads to a change in the noise in the training map and causes very small changes in the LLK targets, although those changes are entirely within the uncertainties in the data. This may lead to seeds being found in very slightly different positions and orientations or, more rarely, in one seed being pushed off the bottom of the list and replaced by another.
Growing will be more affected because the small changes will sometimes be amplified as we grow a chain until a place is reached where two alternative paths are possible and the other one is selected. The outcome is that the resulting traces are similar, but usually some will differ significantly.
The difference between the Buccaneer evaluation indicators, R work and R free was calculated between models built from the same data set (Table 2 ▸). The number of residues uniquely added to a chain is determined by estimating how many chains are present (from how many independent copies of the sequence appear to have been built), allocating each sequenced chain fragment to one chain based on a score which favours compactness and completeness of each chain, and then counting how many residues of the expected sequences have actually been accounted for in this way. The Buccaneer evaluation indicators are interpreted in combination; for example, a model with a high number of residues uniquely allocated to a chain and a low number of residues built is better than a model with a high number of residues built and a low number of residues uniquely allocated to a chain. We deemed that the model is better when the structure completeness is at least 5% higher. The actual difference between the evaluation indicators was replaced by binary labels: ‘Y’ when the indicator is better based on Table 2 ▸ and ‘N’ otherwise. (An example of the labelling of training features is reported in the supporting information.) Under-sampling was applied to class ‘N’ to reduce it from 906 to 281 protein structures and tenfold cross-validation was used to train the decision tree. Each fold had the same proportion of each class as in the training data sets after under-sampling, as Weka uses stratified cross-validation by default.
Whether a higher or lower value of the indicator is better is indicated.
The first model of these multiple models will be built from all of the fragments, as the first threshold used to partition tripeptides into ‘favourable’ and ‘unfavourable’ always has the lowest score. The number of confirmation building cycles is the remaining number of building cycles. For example, if Buccaneer runs on three building cycles, we run two and one confirmation building cycles in the first and second building cycles, respectively; no confirmation building cycle is run in the third building cycle. As our neural network model is limited to 2849 tripeptides, Buccaneer will batch the tripeptides and run the neural network multiple times when the number of tripeptides exceeds this limit.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.