Analysis of high-throughput sequencing datasets

EA E. Agirre
AO A. J. Oldfield
NB N. Bellora
AS A. Segelle
RL R. F. Luco
request Request a Protocol
ask Ask a question
Favorite

Available paired-end RNA-seq data was downloaded from the ENCODE portal (https://www.encodeproject.org/), which compiles both ENCODE and ROADMAP data. Percent of inclusion (PSI) of alternatively spliced exons was calculated using the raw fastq files, in which reads were filtered following the ENCODE guidelines, for human H1 hESC, K562, IMR90, A549, Gm12878, GM19238, HEK293, HelaS3, HepG2 and MCF7 cell lines and mouse mESC files. RNA-seq from the nuclear fraction was used for downstream analysis if total RNA-seq was not available. Data was aligned in hg19 genome with STAR 2.3.0e53 with parameters;–runThreadN 4–outFilterMismatchNmax 2–clip3pAdapterSeq AAAAAA–clip3pAdapterMMp 0 –outSJfilterReads Unique –alignSJDBoverhangMin 3 –alignSJoverhangMin 5–outSJfilterOverhangMin 30 12 12 12–outSJfilterCountUniqueMin 5 1 1 1–outSJfilterIntronMaxVsReadN 50000 100000 200000–sjdbScore 2–outFilterType BySJout–outSAMattributes All–seedSearchStartLmax 50, using an overhang of 99nt for long RNAseq reads and a custom database of exon–intron junction annotations from https://github.com/nellore/intropolis. All the exons and introns were extracted from BioMart using Ensembl72 annotations. Alternatively spliced cassette exons, in which an alternatively spliced exon is flanked by two constitutive exons, were extracted. First and second exons were excluded from the analysis to avoid a chromatin effect from the transcription start sites. Using the Ensembl biotype term, we also discarded from the final dataset all the exons not labelled as protein coding or noncoding RNA. We considered as constitutive exons all the exons annotated with the constitutive exon term in Ensembl72 that were coming from the same transcripts as the selected alternatively splice exons and were included at a PSI > 95% in more than 75% of the 10 cell lines analysed. For each cell line the SJ.out.tab file from STAR output was filtered to recover all the exon-intron junctions present at least with a count of 5 reads. We extracted the number of reads at all the exon–intron (for inclusion) and exon–exon (for exclusion) junctions from the final exon triplets dataset. Final PSI was calculated as a ratio of the number of reads including the exon, divided by the sum of the exclusion and inclusion reads. For the inclusion reads we considered the average value, since we expect reads coming from the acceptor and donor sites. Based on the cumulative distribution of PSI in H1 and IMR90 cell lines, four splicing groups were created: excluded (0% < PSI < 20%), mid-excluded (20% < PSI < 40%), mid-included (40% < PSI < 80%) and included (80% < PSI < 100%).

Regarding gene expression, it was calculated with Salmon 0.8.254, first we quantified the transcriptome for each cell line, using the Ensembl72 transcripts fasta file, and then selected for genes with TPM values ≥10 to discard lowly expressed gene. When comparing different cell lines, alternatively spliced genes with different expression levels were excluded to avoid confounding transcriptional effects.

ChIP-Seq and MeDip-seq data were obtained from the ENCODE portal (https://www.encodeproject.org/), which compiles both ENCODE and ROADMAP data. Reads were mapped to the reference genome hg19 using Bowtie54, keeping the best unique matches, with at most two mismatches to the reference (−v 2 –best –strata -m 1). Reads were extended to 200 nt in the 5′ to 3′ direction using Pyicos55. Then using BedTools56, we removed repetitive reads overlapping centromeres, gaps, satellites and pericentromeric regions. For each sample, we used Pyicos to build clusters with overlapping reads along the genome, discarding single-read clusters. In order to avoid the usage of clusters that are possibly part of the background signal (and not of the real ChIP-Seq signal), we used input samples (when available) for the peak calling normalisation to identify clusters that are significantly above input values19.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A