Endogenous DSBs on NHEK cells were obtained from a previous study (18), considering only the high-confidence 84,946 DSBs common to DSBCapture replicates. Transcription was assessed using polyA-depleted and strand-specific RNA sequencing (RNA-seq) data from nuclear fractions of NHEK cells (GSM2072453). DNA accessibility profiles (DNase-seq) for NHEK cells were previously identified by the National Institutes of Health Roadmap Epigenomics Mapping Consortium. DSB and DNA accessibility coordinates were converted from hg19 into hg38 human genome versions using liftOver (38). Gene annotations were obtained from GENCODE (v26 version), merged into a single transcript model per gene and removed overlapping genes using BEDtools (39). Only DSBs located in the gene body region [500 bp downstream transcription start-site (TSS) to transcription termination-site TTS (transcription start-site) ] of transcriptionally active genes were considered in the downstream analyses. Transcriptionally active genes were defined as those with expression levels transcripts per million [TPMs] from Kallisto (40) higher than the 50th percentile (median). DSBs with antisense transcription were determined with a cutoff of 0.01 RPKMs (reads per kilobase per million mapped reads) in the 500-bp flanking regions, according to the strand-specific information. For the metaprofiles, DSBs were aligned by the median region, and the read density for the flanking 1 kb was averaged in a 20-bp window. All profiles and heatmaps were plotted on normalized RPKMs considering 20-bp windows. A set of in-house scripts for data processing and graphical visualization were written in bash and in the R environmental language (www.R-project.org). SAMtools (41) and BEDtools were used for alignment manipulation, filtering steps, file format conversion, and comparison of genomic features. Fold enrichment over random and statistical significance of the overlap between transcription and DSBs was assessed by permutation analysis. Briefly, random DSBs datasets were generated 1000 times from transcriptionally active genes using the shuffle BEDtools function (maintaining the number and length of the original datasets). The P value was determined as the frequency of overlapping regions between the random datasets as extreme as the observed data.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.