The features used by the ANN are described below (number in parentheses represent the number of neurons):
Aligned/unaligned percentages flanking the novel adjacency (5)
Alignment E values flanking the novel adjacency (2)
Relative alignment bit scores flanking the novel adjacency (2)
Alignment identities flanking the novel adjacency (2)
The fraction of mismatches in alignments flanking the novel adjacency (2)
The fraction of gaps in alignments flanking the novel adjacency (2)
SV complexity—number of coexisting SV found at the novel adjacency (1)
Total number of alignments found on read (1)
Total number of SV that seemed to be captured by read (1)
Number of different chromosomes the read aligns (1)
The fraction of alignments less than 5% of read length (1)
Number of breakend-supporting reads B (1)
The fraction of breakend-supporting reads B over total read depth B + O (1)
If SV is an insertion/deletion, the size of the inserted/deleted segment (1)
The value of each feature is scaled to the range of [0, 1] by min-max normalization. The Python library Keras [41] was used to build and infer the ANN model. The backend engine used with Keras is TensorFlow [42]. The neural network model is a feed-forward network consisting of a 23 neuron input layer, two hidden layers of 12 and 5 neurons sequentially, and a single neuron output layer. The rectified linear unit (ReLU) activation function is used for the two hidden layers, while the Sigmoid activation function is used for the output layer. Dropout regularizations were implemented after each hidden layer with probabilities of 0.4 and 0.3 sequentially. If yk, i denotes the value of the ith neuron in the k layer, we have that
where F(x) = max(x,0) denotes the ReLU non-linearity and is the neural weight between the jth neuron of the (k − 1)th layer and the ith neuron of the kth layer.
Ten million in silico 3GS reads simulated from a simulated genome consisting of 61,316 mixed zygosity SV were used to train a binary classifier ANN model through supervised learning. The ten million reads were distributed randomly into 20 sub-datasets before read-depth clustering to reduce the sequencing depth to 1X. The entire training dataset consists of 933,351 true and 41,186 false examples of novel adjacencies. Another simulated dataset (4X) with a different SV profile was used as the test dataset. Binary cross entropy was used as the loss function, and stochastic gradient descent (SGD) was used as the optimizer algorithm with their default parameters. The classification accuracy is collected and reported as a metric to assess the performance of the model. Sixty-three epochs were performed for the model training, with each epoch having 12,000 true and 12,000 false randomly selected examples and a batch size of 400 examples per iteration.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.