TR segmentation and filtering algorithm

Dongju Chen; Minghui Shao; Pei Meng; Chunli Wang; Qi Li; Yuhang Cai; Chengcheng Song; Xi Wang; Taiping Shi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

TR segmentation and filtering algorithm

DC Dongju Chen

MS Minghui Shao

PM Pei Meng

CW Chunli Wang

QL Qi Li

YC Yuhang Cai

CS Chengcheng Song

XW Xi Wang

TS Taiping Shi

This method is extracted from research article: BMC Bioinformatics, Nov 2021

GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

DOI: 10.1186/s12859-021-04487-9

Request a Protocol

Ask a question

Favorite

The Tree recursion (TR) Segmentation and Filtering Algorithm was developed by C++ . The input data format of the algorithm is (i) BAF data and (ii) LRR data. To reduce the noise in the input data, both BAF and LRR are preprocessed by a specially designed segmentation and filtering algorithm. First, if BAF ≥ 0.95 or BAF ≤ 0.05, defined as homozygous, the data would be removed from the BAF track because of its uselessness. Then the remaining BAF value is mirrored and flipped upward with 0.5 as the center, thus BAF =|BAF − 0.5|+ 0.5. For LRR, the bin LRR values are also first optionally filtered for outliers, defined as the total probability density is below the 30% quantile in all bins.

Next, the in-house TR segmentation algorithm, based on the calculation of the run-length, was used to roughly segment each chromosome, as shown in Additional file 1: Fig. S1. In this algorithm, the whole chromosome is taken as the root node, all the segmented sub-nodes are taken as the child nodes. The segmentation process can be simply described as the following steps:

Calculate the cumulative run-length of data (here refers to BAF and LRR) deviated from the mean ( $\sum (x - \bar{x})$ ) and select its maximum and minimum points as candidate breakpoints.

Make appropriate trade-offs of candidate breakpoints according to the location of breakpoints, length of segments, number of data points in segments, etc., that is, determine whether breakpoints should be recorded.

If none of the subfragment of the current fragment satisfies the record condition, a recursive judgment is initiated. Otherwise, it recursively slices its last child node.

After the termination condition is reached, recursion is carried out on horizontal child nodes.

If all child objects have been processed, the parent's level object will continue to be processed until it is finally traced back to the root node and no new child objects are created.

Then the fragments are merged in a cyclic manner. Firstly, for each segmented fragment of chromosome was traversed by the kernel density estimation, to find out the two fragments, which are closest to the same distribution and combine them. Secondly, the statistics of the newly merged fragment and its adjacent fragments are recalculated until all indicators meet the requirements. Besides, segmentation of BAF and LRR is carried out separately, and then the union set of the merged BAF and LRR segmentation list is taken, but the regions with too short or insufficient data points are iteratively removed.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol