TR segmentation and filtering algorithm

DC Dongju Chen
MS Minghui Shao
PM Pei Meng
CW Chunli Wang
QL Qi Li
YC Yuhang Cai
CS Chengcheng Song
XW Xi Wang
TS Taiping Shi
request Request a Protocol
ask Ask a question
Favorite

The Tree recursion (TR) Segmentation and Filtering Algorithm was developed by C++ . The input data format of the algorithm is (i) BAF data and (ii) LRR data. To reduce the noise in the input data, both BAF and LRR are preprocessed by a specially designed segmentation and filtering algorithm. First, if BAF ≥ 0.95 or BAF ≤ 0.05, defined as homozygous, the data would be removed from the BAF track because of its uselessness. Then the remaining BAF value is mirrored and flipped upward with 0.5 as the center, thus BAF =|BAF − 0.5|+ 0.5. For LRR, the bin LRR values are also first optionally filtered for outliers, defined as the total probability density is below the 30% quantile in all bins.

Next, the in-house TR segmentation algorithm, based on the calculation of the run-length, was used to roughly segment each chromosome, as shown in Additional file 1: Fig. S1. In this algorithm, the whole chromosome is taken as the root node, all the segmented sub-nodes are taken as the child nodes. The segmentation process can be simply described as the following steps:

Calculate the cumulative run-length of data (here refers to BAF and LRR) deviated from the mean (x-x¯) and select its maximum and minimum points as candidate breakpoints.

Make appropriate trade-offs of candidate breakpoints according to the location of breakpoints, length of segments, number of data points in segments, etc., that is, determine whether breakpoints should be recorded.

If none of the subfragment of the current fragment satisfies the record condition, a recursive judgment is initiated. Otherwise, it recursively slices its last child node.

After the termination condition is reached, recursion is carried out on horizontal child nodes.

If all child objects have been processed, the parent's level object will continue to be processed until it is finally traced back to the root node and no new child objects are created.

Then the fragments are merged in a cyclic manner. Firstly, for each segmented fragment of chromosome was traversed by the kernel density estimation, to find out the two fragments, which are closest to the same distribution and combine them. Secondly, the statistics of the newly merged fragment and its adjacent fragments are recalculated until all indicators meet the requirements. Besides, segmentation of BAF and LRR is carried out separately, and then the union set of the merged BAF and LRR segmentation list is taken, but the regions with too short or insufficient data points are iteratively removed.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A