RNA-seq expression analysis

Stefan Bohn; Nevan J. Krogan

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

RNA-seq expression analysis

SB Stefan Bohn

NK Nevan J. Krogan

Last updated date: Dec 30, 2020 Views: 1282 Forks: 0

An abbreviated version of this protocol was published in Science in Dec, 2020

Genetic interaction mapping informs integrative structure determination of protein complexes

Download PDF

Ask a question

How to cite

Favorite

1. Harvest 10ml of overnight S. cerevisiae in mid-log phase (OD600 = 1.0) and wash with DEPC-H2O.

2. Extract RNA with hot acid phenol as described (Ref. 1).

3. Generate RNA-seq libraries. For the following workflow, the QuantSeq 3’ mRNASeq Library Prep Kit FWD for Illumina (Lexogen) was used.

4. Sequence the single-end, 50 base pair reads. Here, an Illumina HiSeq 4000 sequencer was used.

5. Download S. cerevisiae genome from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/). Here, version sacCer3 was used.

6. Convert sacCer3.2bit to fasta format with "twoBitToFa" (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/)

./twoBitToFa sacCer3.2bit saccer3.fa

7. Create index-files for the yeast genome sacCer3 for bowtie2 (Step 12).

bowtie2-build saccer3.fa index/saccer3

8. Generate "genes.gtf" file, containing annotations for genome used (sacCer3), using the online repository http://genome.ucsc.edu/cgi-bin/hgTables. Here, the following options were chosen: clade: other, genome: Scerevisiae, assembly: Apr.2011(sacCer3), group: genes and predictions, track SGD, table: sgdGene, region: genome, output: gtf

9. Build transcriptome using “genes.gtf” from Step 8 and the transcriptome index from Step 7 using tophat (https://ccb.jhu.edu/software/tophat/index.shtml).

tophat -p 4 -G annotation/genes.gtf --transcriptome-index=index/saccer3 -o out/out1 --no-novel-juncs index/saccer3 data/wt_0_S46_L004_R1_001.fastq

10. Create index-files for the yeast rRNAs for bowtie2 (*.bt2).

bowtie2-build index/sc-rrna.fa index/saccerrrna

11. Remove the random primer sequence, adapter contamination, and low quality tails (example given from https://www.lexogen.com/quantseq-data-analysis/ for library kit used in step 3).

polyA = 14;

for sample in *.fastq; do cat ${sample} | bbmap/bbduk.sh in=stdin.fq out=${sample}_trimmed_clean ref=bbmap/resources/polyA.fa.gz,bbmap/resources/truseq.fa.gz k=13 ktrim=r forcetrimleft=11 useshortkmers=t mink=5 qtrim=t trimq=10 minlength=20; done

12. Align each fastq file to the rRNA index (index/saccerrrna from Step 10) and save unaligned sequences to new fastq file using bowtie (https://sourceforge.net/projects/bowtie-bio/). This filters out ribosomal RNA reads.

bowtie -v2 -p4 index/saccerrrna data/{sample}.fastq_trimmed_clean --un /data/{sample}_trim_saccerrrna-unalign.fastq >/dev/null

13. Create alignments with tophat of the filtered data.

tophat -p 4 --transcriptome-index=index/saccer3--no-novel-juncs -o out/trim-rrna index/saccer3 data/{sample}_trim_saccerrrna-unalign.fastq

14. Filter the data based on their quality using MAPQ filtering.

for fff in out/trim-rrna/*.bam; do echo "Running on this file: $fff"; samtools view -bq 50 $fff > $fff.mapq50.bam; done

15. Create index files for each .bam file from Step 14 using samtools (http://www.htslib.org/).

samtools index out/trim-rrna/{sample}_accepted_hits.bam

16. Extract counts for each sample using htseq-count (https://htseq.readthedocs.io/en/release_0.11.1/install.html).

htseq-count -f bam out/trim-rrna/{sample}_accepted_hits.bam genes.gtf > out/trim-rrna/{sample}_accepted_hits_count.txt

17. Counts-based expression values are calculated using R (https://cran.r-project.org/bin/windows/base/), RStudio (https://rstudio.com/products/rstudio/download/#download) and RTools (https://cran.r-project.org/bin/windows/Rtools/, BiocManager (https://bioconductor.org/install/) and Dseq2 (http://bioconductor.org/packages/release/bioc/html/DESeq2.html).

18. Generate combined “count”-files that contain the counts of each replica of a given sample as well as the reference sample (here “wildtype”) as columns in a tab delimited txt document. It should contain the name of the samples in the first row. The first column designates the target gene region. The data matrix is of the size of (number of samples) x (number of gene regions).

region_name wt-rep#1 wt-rep#2 wt-rep#3 sample1-rep1 sample1-rep2 sample1-rep3

YAL069W 1 0 2 0 0 0

[…] […] […] […] […] […] […]

19. Generate a “table”-file for each sample indexing each column of data. The column sample_name should match the names of the samples in the first row of the "count"-file from Step 18. The column condition allows DSEq2 to correctly identify replicas (Step 21).

sample_name condition

wt-rep#1 WT

wt-rep#2 WT

wt-rep#3 WT

sample1-rep1 sample1

sample1-rep2 sample1

sample1-rep3 sample1

20. In R, load the Dseq2 library, the combined counts-file from Step 18 and the table-file from Step 19.

library(DESeq2)

count_table ← read.delim(“combined_counts_wt-sample1.txt”,sep=”\t”,header=TRUE,row.names=”region_name”)

sample_table ← read.delim(“table_wt-sample1.txt”,sep=”\t”,header=TRUE,row.names=”sample_name”

21. Plot data and write RNAseq expression values to file.

dds ← DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ condition)

dds ←DESeq(dds)

res ← results(dds)

resOrdered ← res[order(res$padj),]

plot ← plotMA(res, main = “mutant”, ylim = c(-2,2), xlab = “mean count”)

write.table(as.data.frame(resOrdered),sep=”\t”,quote=FALSE,file=”out/wt_sample1_p-values.txt”

References

1. M. A. Collart, S. Oliviero, Preparation of yeast RNA. Curr Protoc Mol Biol Chapter 13, Unit13 12 (2001).

Related files

protocol rnaseq.docx download

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Bohn, S and Krogan, N J(2020). RNA-seq expression analysis. Bio-protocol Preprint. bio-protocol.org/prep727.
Braberg, H., Echeverria, I., Bohn, S., Cimermancic, P., Shiver, A., Alexander, R., Xu, J., Shales, M., Dronamraju, R., Jiang, S., Dwivedi, G., Bogdanoff, D., Chaung, K. K., Hüttenhain, R., Wang, S., Mavor, D., Pellarin, R., Schneidman, D., Bader, J. S., Fraser, J. S., Morris, J., Haber, J. E., Strahl, B. D., Gross, C. A., Dai, J., Boeke, J. D., Sali, A. and Krogan, N. J.(2020). Genetic interaction mapping informs integrative structure determination of protein complexes. Science 370(6522). DOI: 10.1126/science.aaz4910