TRACE-seq Data analysis

Bo Lu; Liting Dong; Danyang Yi; Meiling Zhang; Chenxu Zhu; Xiaoyu Li; Chengqi Yi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

TRACE-seq Data analysis

Last updated date: Nov 10, 2020 Views: 1046 Forks: 0

An abbreviated version of this protocol was published in eLife in Jul, 2020

Transposase-assisted tagmentation of RNA/DNA hybrid duplexes

Download PDF

Ask a question

How to cite

Favorite

1. Quality control and adaptor trimming by Trim Galore:

mkdir 1.adaptor_trimmed

for sample in `find rawdata/ -name "*_R1.fastq.gz"`

sample2=${sample/R1/R2}

trim_galore -q 20 --phred33 --stringency 3 --length 20 -o 1.adaptor_trimmed/ --paired

$sample $sample2

done

2. Down-sampling by seqtk:

mkdir 2.DownSampling

for sample in `find 1.adaptor_trimmed/ -name "*fq.gz" |sort`

file=$(basename $sample)

name=${file%%_val*}

seqtk sample -s 100 $sample 30000000 >2.DownSampling/${name}_30M.fq

#seqtk sample -s 100 $sample 60000000 >2.DownSampling/${name}_60M.fq

done

3. Read mapping by STAR:

(1)

REFDATH=Reference/human/hg19

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ${REFDATH}/genome/

--genomeFastaFiles ${REFDATH}/genome/hg19_genome.fa --sjdbGTFfile${REFDATH}/

transcriptome/hg19_refGene.gtf --sjdbOverhang 149

#REFDATH=Reference/mouse/mm10

#STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ${REFDATH}/genome/

--genomeFastaFiles ${REFDATH}/genome/mm10_genome.fa --sjdbGTFfile ${REFDATH}/

transcriptome/mm10_refGene.gtf --sjdbOverhang 149 40

(2)

REFDATH=Reference/human/hg19

#REFDATH=Reference/mouse/mm10

mkdir 3.STAR_out

for sample in `find 2.DownSampling -name "*R1_30M.fq" |sort`

sample2=${sample/R1/R2}

file=$(basename $sample)

name=${file%%_R1*}

STAR --runThreadN 20 --outSAMattributes NH HI AS nM NM MD jM jI XS MC ch --

genomeDir ${REFDATH}/genome/ --readFilesIn $sample $sample2--outSAMtype BAM

SortedByCoordinate --outMultimapperOrder Random --outFileNamePrefix

3.STAR_out/${name}_ --outFilterIntronMotifs RemoveNoncanonicalUnannotated

done

4. Filtering mapped Read by samtools:

for sample in `find 3.STAR_out -name "*sortedByCoord.out.bam"`

samtools view -f 3 -F 256 -b $sample -o ${sample/Aligned*out/primary_alignment

_only}

samtools index ${sample/Aligned*out/primary_alignment_only}

done

5. counting Reads per gene with HTSeq:

(1)

for sample in `find 3.STAR_out -name "*_primary_alignment_only.bam"`

samtools sort -n -@ 10 -o ${sample/primary_alignment_only/name_sorted}$sample

done

(2)

gtf=Reference/mouse/mm10/transcriptome/mm10_refGene.gtf

mkdir 4.HTseq_count

for sample in `find 3.STAR_out -name "*name_sorted.bam"|sort`

file=`basename $sample`

name=${file%%_name*}

htseq-count -m intersection-strict --secondary-alignments=ignore --supplementary-

alignments=ignore -f bam -s no $sample $gtf >../4.HTseq_count/${name}_strict.count

done

6. Differential gene expression analysis with DESeq2:

library(DESeq2)

htseq_dir<-"~/Documents/TRACE-seq/htseq_count_strict/"

sample_files<-sort(grep(".count",list.files(htseq_dir),value = T))

sample_names<-sub("(.*)_strict.count","\\1",sample_files)

sample_condition<-c("treated","treated","treated","untreated","untreated","untreated")

sampleTable<-data.frame(sampleName=sample_names, fileName=sample_files,

condition=sample_condition)

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory =

htseq_dir, design = ~ condition)

ddsHTSeq<-ddsHTSeq[rowSums(counts(ddsHTSeq))>=10,]

ddsHTSeq$condition<-relevel(ddsHTSeq$condition, ref = "untreated")

ddsHTSeq<-DESeq(ddsHTSeq)

resLFC <- lfcShrink(ddsHTSeq, coef="condition_treated_vs_untreated", type="normal",

lfcThreshold = 1)

res_total<-data.frame(resLFC)

for (i in 1:nrow(res_total)) {

if (res_total[i,'padj'] < 0.05 & res_total[i,'log2FoldChange'] > 1) {

res_total[i,'type'] <- 'AA'

} else if (res_total[i,'padj'] < 0.05 & res_total[i,'log2FoldChange'] < -1) {

res_total[i,'type'] <- 'BB'

} else {

res_total[i,'type'] <- 'Non_sig'

105 }

106 }

sum(res_total$type == 'AA')

sum(res_total$type == 'BB')

pdf("volcano.pdf", height = 5, width = 7)

library('ggplot2')

as.factor(res_total$type)

ggplot(res_total,aes(log2FoldChange,-log10(padj),colour=type))+

geom_point()+

xlim(-10,10)+

ggtitle("NEBNext Ultra II RNA")+

theme(plot.title = element_text(size = 20,hjust = 0.5),

axis.title = element_text(size = 20),

panel.grid.major = element_blank(),

panel.grid.minor = element_blank(),

panel.background = element_blank(),

axis.line = element_line(colour = "black"))+

labs(colour="Differential Expressed Genes(4452)")+

xlab('log2FoldChange')+

ylab('-log10(padj)')+

scale_color_manual(labels=c("Up regulated: 3157", "Down regulated: 1295", "Non_

significant"), values = c('#F8766D','#00BFC4','black'))

dev.off()

7. Gene expression measurement by cuffnorm:

gtf=Reference/human/hg19/transcriptome/hg19_refGene.gtf

mkdir cuffnorm

cuffnorm -p 28 $gtf 3.STAR_out/trace_200ng_1_primary_alignment_only.bam,3.STAR_o

ut/trace_200ng_2_primary_alignment_only.bam,3.STAR_out/trace_200ng_3_primary_ali

gnment_only.bam -o cuffnorm

8. Gene body coverage analysis by QoRTs:

gtf=Reference/human/hg19/transcriptome/hg19_refGene.gtf

for sample in `find 3.STAR_out/ -name "*primary_alignment_only.bam"`

file=`basename $sample`

name=${file%%_primary*}

mkdir -p QoRTs/whole_genebody/genebody_${name}

java -Xmx4G -jar QoRTs.jar QC --maxReadLength 150 --prefilterImproperPairs --

runFunctions writeGeneBody --minMAPQ 50 --generatePlots $sample $gtf

QoRTs/whole_genebody/genebody_${name}/

done

9. Nucleotide-vs-Cycle counts by QoRTs:

gtf=Reference/human/hg19/transcriptome/hg19_refGene.gtf

for sample in `find 3.STAR_out/ -name "*primary_alignment_only.bam"`

file=`basename $sample`

name=${file%%_primary*}

mkdir NVC_${name}

java -Xmx4G -jar QoRTs.jar QC --maxReadLength 150 --prefilterImproperPairs --

runFunctions NVC --minMAPQ 50 --generatePlots $sample $gtf NVC_${name}

done

10. GC content analysis by RSeQC:

for sample in `find 3.STAR_out/ -name "*primary_alignment_only.bam"`

file=`basename $sample`

name=${file%%_primary*}

mkdir -p RSeQC/GC_${name}

read_GC.py -i $sample -o RSeQC/ GC_${name}

done

11. Reads distribution analysis by RSeQC:

bed=Reference/human/hg19/transcriptome/hg19_refseq.bed

for sample in `find 3.STAR_out/ -name "*primary_alignment_only.bam"`

read_distribution.py -i $sample -r $bed

done

12. Calculating median coefficient of variation of gene coverage by Picard Tools:
mkdir CV
REFDATH=Reference/human/hg19
for sample in `find 3.STAR_out/ -name "*primary_alignment_only.bam"`
do
file=`basename $sample`

name=${file%%_primary*}

java -jar picard.jar CollectRnaSeqMetrics I=$sample O=CV/${name}.RNA_Metrics

REF_FLAT=${REFDATH}/hg19_refFlat.txt STRAND=NONE RIBOSOMAL_INTERVALS=

${REFDATH}/hg19.rRNA.interval_list

done

13. Calculating insert size of library by Picard Tools:

mkdir insert_size

for sample in `find 3.STAR_out -name "*primary_alignment_only.bam"`

file=`basename $sample`

name=${file%%_primary*}

java -jar picard.jar CollectInsertSizeMetrics M=0.01 I=$sample O=insert_size/insert_

size_${name}_metrics.txt H=insert_size/insert_size_${name}_histogram.pdf

done

14. Library complexity analysis by Preseq:

mkdir preseq_out

for sample in `find 3.STAR_out -name "*primary_alignment_only.bam"`

file=`basename $sample`

name=${file%%_primary*}

preseq c_curve seg_len 10000000000000000 -o preseq_out/${name}_c_curve.txt -P -B

$sample

done

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Lu, B, Dong, L, Yi, D, Zhang, M, Zhu, C, Li, X and Yi, C(2020). TRACE-seq Data analysis. Bio-protocol Preprint. bio-protocol.org/prep617.
Lu, B., Dong, L., Yi, D., Zhang, M., Zhu, C., Li, X. and Yi, C.(2020). Transposase-assisted tagmentation of RNA/DNA hybrid duplexes. eLife. DOI: 10.7554/eLife.54919