Computational analysis

Chadi A El Farran

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

Computational analysis

CE Chadi A El Farran

Last updated date: Mar 22, 2021 Views: 861 Forks: 0

An abbreviated version of this protocol was published in Science Advances in Sep, 2020

Diversification of reprogramming trajectories revealed by parallel single-cell transcriptome and chromatin accessibility sequencing

Download PDF

Ask a question

How to cite

Favorite

1- Obtaining proper gtf for ucsc assemblies

###Use this if you want obtain proper GTF assembly for UCSC

# 1. download a program called "genePredToGtf" from UCSC:

# place the correct version of the executable somewhere in your path

# 2. Create the following file in your home directory:

echo 'db.host=genome-mysql.cse.ucsc.edu

db.user=genomep

db.password=password' > ~/.hg.conf

# the file's permissions must be user-only

chmod 0600 ~/.hg.conf

# 3. run "genePredToGtf" with any organism and any table that is in "genePred" format:

genePredToGtf mm9 refFlat mm9_refFlat.gtf

genePredToGtf hg19 refFlat hg19_refFlat.gtf

etc etc

### I now highly recommend using GenCode assemblies instead because their GTFs are more suitable for analyses. And the GTF is ready made there and can be easily downloaded.

2- Creating a genome index for star mapper

Genome indices are used as references for mapping fastq files. Genome indices were created using STAR (Dobin & Gingeras 2015).

For creation of the genome indices, the whole genome sequence of mouse was downloaded in FASTA format (Appendix 3) from UCSC Genome Database (http://hgdownload.soe.ucsc.edu/downloads.html). Additionally, the genes annotation file for the same genome was downloaded in a Gene Transfer Format (GTF)¹. For mouse, the genome assembly mm9 was used.

The following script was used for the creation of the indices:

{/Path/To/STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /Path/To/Directory/ --genomeFastaFiles /Path/To/Genome/FASTA/file --sjdbGTFfile /Path/To/Genome/GTF/file}

/Path/To/STAR indicated the directory in which STAR software was located.

--runThreadN 10 indicated that 10 CPU processes were used to create the genome indices.

--runMode genomeGenerate denoted that STAR was used during the run to create genome index files.

--genomeDir /Path/To/Directory/ indicated the directory in which the created genome index files were stored.

--genomeFastaFiles /Path/To/Genome/FASTA/file indicated the directory in which the genome FASTA file was located.

--sjdbGTFfile /Path/To/Genome/GTF/file indicated the directory in which the genome GTF file was located. In case of mapping libraries of human origin, they were mapped to the genome GTF file hg19².

¹Check downloading appropriate GTF file document to understand how to obtain a good quality GTF

² Recently I am using Human GenCode assembly which has the fasta files and ready made GTF (https://www.gencodegenes.org/human/). Their assembly is comprehensive and better than hg19. You should use their fasta file if using their GTF.

### Sometimes if your creating an index for a very small genome (example mapping to exogenous sequence etc..), you will need to add the following option:

--genomeSAindexNbases and set it to small number. Follow this formula (~log2(GenomeLength)/2-1). You have to try a value near to the number the formula shows and keep trying till the genome generate script does not crash.

### You only need to do this one time. You can use the genome index for mapping as many libraries as you wish.

3- Quality control of the fastq files

FastQC (http://www.bioinformatics.baraham.ac.uk/projects/fastqc/) is used to determine the quality control of the NGS libraries. The raw NGS library files (.fastq extension) were uploaded to the graphical user interface (GUI) version of the software.

The libraries which satisfied the following set of criteria were considered of good quality.

a) The depth of the library was more than 20 million reads.

b) The average quality score of most of the reads should be above 35.

c) The library should be free of adapter contamination (adapter level below ~ 0.1 %)

d) The libraries should not demonstrate any bias in the prevalence of any base at a particular location within the sequenced reads (i.e. the percentage of A at the first nucleotide of every sequenced read is equal to the percentage of A at the 10th nucleotide of every sequenced read).

e) GC content of the sequenced library is ~ 45% - 55%.

###For point d) trimming from the 5’ end usually solves biases arising which could be due to adaptor contamination or general first few 10 bases being usually of lower quality. For trimming reads refer to trimming document or STAR mapping document.

### If despite passing all these quality controls, still mapping percentage is low, suspect mycoplasma contamination. Map to mycoplasma genome and check the mapping percentage.

4- Mapping rna-seq libraries using star (fluidigm libraries)

Raw RNA-Seq files were mapped to the genome index (refer to creating genome index documeny) files using STAR. The following script was used for mapping the RNA-Seq libraries.

{/Path/To/STAR --runThreadN 10 --genomeDir /Path/To/Directory

--readFilesIn /Path/To/R1.fastq /Path/To/R2.fastq --outFileNamePrefix

/Path/To/Mapped/File --outFilterMismatchNmax 3 --

outFilterMultimapNmax 500 --outSAMtype BAM SortedByCoordinate}

/Path/To/STAR denoted the directory in which STAR software was located.

--runThreadN 10 indicated that 10 CPU processes were used to create the

genome indices.

--genomeDir /Path/To/Directory/ indicated the directory in which the

genome index files were located.

--readFilesIn /Path/To/R1.fastq /Path/To/R2.fastq denoted the path in

which the paired end fastq files were located. (For single end libraries obviously just add the path to one file)

--outFileNamePrefix /Path/To/Mapped/File was used to choose a location

for naming and saving the mapped files.

--outFilterMismatchNmax 3 was used to allow reads with only 2 or below

mismatches with the genomes. Reads with more than 2 mismatches were

filtered out.

--outFilterMultimapNmax 500 was used to maintain reads that map up to

500 loci in the genome or below. This parameter was set high to retain reads

that mapped to more than one locus in the genome. (If you are not interested in Repeat elements lower this number to 2).

--outSAMtype BAM SortedByCoordinate was used to make the output a

sorted BAM file (Appendix 3 in my thesis).

### You can now trim using STAR if the 5’ or 3’ sequences are of low quality or heavily contaminated with adaptors. Add the option --clip5pNbases or --clip3pNbases. Add the number according to the position of low quality bases.

### I prefer to adjust --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to 0.25. It enhances the mapping percentages specially in case of single cell libraries.

5- Gene counting using cuffquant (fluidigm libraries)

Gene Counting can be performed using cuffquant script of the Cufflink software suite (Trapnell et al. 2010).

{cuffquant -o /Path/to/Output/ -p 10 -b /Path/To/Genome.fasta -u /Path/To/Gene/Annotations.gtf /Path/to/Mapped/RNA-Seq.bam}

cuffquant called the script

-o /Path/To/Output/ informed the script of the location to save the output file (.cxb file).

-p 10 ran the script using 10 CPU processes.

-b /Path/To/Genome.fasta informed the software where the FASTA file of the genome was located. This helped the software to perform more accurate estimation of transcripts in each library sample. Make sure it is the same FASTA file used to create the genome index for mapping the libraries. I now highly recommend GenCode Assemblies.

-u informed the software to thoroughly analyze reads mapping to more than

one locus before discarding such reads.

/Path/To/Gene/Annotations.gtf indicated the location in which the gene annotations file was stored. The same GTF file was used during the creation of the genome index files.

/Path/to/Mapped/RNA-Seq.bam indicated the locations of the mapped file.

#### Add --library-type fr-firststrand before the path to GTF file if the the libraries were prepared using strand specific library preparation kit. (Note: if your libraries were prepared using unstranded library preparation kit, you should remove this option + Fluidigm single cell RNA-Seq libraries are unstranded hence you do not need this option)

### I highly recommend using the GenCode assemblies. They were shown to give more accurate analyses.

6- Obtaining normalized values using cuffnorm (fluidigm libraries)

FPKM normalized matrix can be obtained using cuffnorm script of the Cufflink software suite (Trapnell et al. 2010).

{cuffnorm -o /Path/To/Output/Directory/ -p 15 --library-norm-method classic-fpkm -L label1,label2...,label(n) /Path/To/Gene/Annotations.gtf /Path/To/Cuffquant(1).cxb /Path/To/Cuffquant(2).cxb ... /Path/To/Cuffquant(n).cxb}

cuffnorm called the script

-o /Path/To/Output/Directory/ informed the script of the location to save the

output files.

-p 15 ran the script using 15 CPU processes.

-L label1,label2...,label(n) indicated the label given to each library sample. Each cxb output of cuffquant should be given one label.

/Path/To/Gene/Annotations.gtf indicated the location in which the gene annotations file was stored. The same GTF file was used during the creation of the genome index files.

/Path/To/Cuffquant(1).cxb /Path/To/Cuffquant(2).cxb ... /Path/To/Cuffquant(n).cxb indicated the location of the cuffquant outputs belonging to each library. They should be placed in the same order they were given labels.

The normalized matrix can be found in genes_fpkm file.

This normalized matrix can be used for clustering the libraries using Seurat.

### I highly recommend using the GenCode assemblies. They were shown to give more accurate analyses.

7- Using seurat for clustering the libraries using the fpkm table (fluidigm libraries)

8- Using rca for clustering the libraries using the fpkm table (fluidigm libraries)

Related files

1- Obtaining proper GTF for UCSC assemblies.doc download

2- Creating a genome index for STAR mapper.doc download

3- Quality Control of the fastq files.doc download

4- Mapping RNA-Seq libraries using STAR (Fluidigm Libraries).doc download

5- Gene Counting using Cuffquant (Fluidigm Libraries).doc download

6- Obtaining normalized values using cuffnorm (Fluidigm Libraries).doc download

7- Using Seurat for clustering the libraries using the FPKM table (Fluidigm Libraries).doc download

8- Using RCA for clustering the libraries using the FPKM table (Fluidigm Libraries).doc download

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

El Farran, C(2021). Computational analysis. Bio-protocol Preprint. bio-protocol.org/prep957.
Xing, Q. R., Farran, C. A. E., Gautam, P., Chuah, Y. S., Warrier, T., Toh, C. X. D., Kang, N. Y., Sugii, S., Chang, Y. T., Xu, J., Collins, J. J., Daley, G. Q., Li, H., Zhang, L. F. and Loh, Y. H.(2020). Diversification of reprogramming trajectories revealed by parallel single-cell transcriptome and chromatin accessibility sequencing . Science Advances 6(37). DOI: 10.1126/sciadv.aba1190