The validation of variants is based on the PEM signature. First, for each suspicious variant, we extracted all the read pairs in which both ends were mapped within the region of interest (ROI) from the corresponding BAM file. To provide an adaptive zoom, the ROI is defined as the genomic region that extends both upstream and downstream with 1 kbp plus half of the variant length.
Next, the F-score [23] is employed to measure the overlapping quality between the span of a suspicious variant and that of a mapped read pair. The F-scores quantify the overlap quality between two spans, with values ranging from 0 to 1 (see Figure 1, which demonstrates several typical scores). A small value close to 0 means a bad overlap, whereas a high value close to 1 means a good overlap. The F-score is calculated as follows: for a test span, if it has no overlap with the reference span, the F-score is set to 0; otherwise, F = 2(PR/(P + R)), where P is the precision (percentage of the test span that overlaps with the reference span) and R is the recall (percentage of the reference span that overlaps with the test span).
An illustration between F-score and overlapping quality. The bottom red span is the reference, and the 11 blue spans are tests, whose F-scores are shown with respect to the reference, ranging from 0.1 (very bad overlapping) to 1 (perfect one).
In our study, mapped read pairs with F-scores larger than 0.7 were selected, and the average and sum of the mapping quality of all selected pairs were calculated. A suspicious variant was identified as a true positive if the average and sum were above 30 and 90, respectively; otherwise, it was classified as a false positive. Therefore, a pair with a mapping quality of 90, two pairs with mapping quality of 45, or three pairs with a mapping quality of 30 constitute the minimal requirement to confirm a true positive. Figure 2 demonstrates two typical examples. It is shown that both regions have high GC content and low mappability; (a) shows no PEM signature while (b) does. Therefore, (a) is a false positive, and (b) is a true positive.
Two examples of suspicious variants. (a) A false positive and (b) a true positive of sample NA18507. The left upper, middle, and lower panels of each subfigure display the profiles of GC content, mappability, and DOC, respectively; the right panel displays the PEM profile, and each horizontal line represents a read pair, where the face colour encodes the mapping quality (yellow and black represent low and high mapping quality, respectively). The green bar in each panel is the studied DGV variant.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.