Illumina MiSeq data analysis

François Kroll; Gareth T Powell; Marcus Ghosh; Gaia Gestri; Paride Antinucci; Timothy J Hearn; Hande Tunbak; Sumi Lim; Harvey W Dennis; Joseph M Fernandez; David Whitmore; Elena Dreosti; Stephen W Wilson; Ellen J Hoffman; Jason Rihel

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Illumina MiSeq data analysis

FK François Kroll

GP Gareth T Powell

MG Marcus Ghosh

GG Gaia Gestri

PA Paride Antinucci

TH Timothy J Hearn

HT Hande Tunbak

SL Sumi Lim

HD Harvey W Dennis

JF Joseph M Fernandez

DW David Whitmore

ED Elena Dreosti

SW Stephen W Wilson

EH Ellen J Hoffman

JR Jason Rihel

This method is extracted from research article: eLife, Jan 2021

A simple and effective F0 knockout method for rapid screening of behaviour and other complex phenotypes

DOI: 10.7554/eLife.59683

Request a Protocol

Ask a question

Favorite

Illumina MiSeq data was received as two fastq files for each well, one forward and one reverse. The paired-end reads were aligned to the reference amplicon with the package bwa v0.7.17 and the resulted bam alignment file was sorted and indexed with samtools v1.9 (Li et al., 2009). To keep only high-quality reads, any read shorter than 140 bp, with a Phred quality score below 40, or with more than 20% of its length soft-clipped were discarded from the bam file before analysis. Whenever necessary, bam alignment files were visualised with IGV v2.4.10. The resulting filtered bam file was converted back to a forward and a reverse fastq file using bedtools v2.27.1 (Quinlan and Hall, 2010). The filtered fastq files were used as input to the R package ampliCan (Labun et al., 2019), together with a csv configuration file containing metadata information about the samples. AmpliCan was run with settings min_freq = 0.005 (any mutation at a frequency below this threshold was considered as a sequencing error) and average_quality = 25; other parameters were left as default. AmpliCan detected and quantified mutations in the reads and wrote results files that were used for subsequent analysis. Reads from uninjected or scrambled-injected controls were used to normalise the mutation counts, i.e. any mutation present in the control embryo was not counted as a Cas9-induced mutation in the injected ones. Downstream of ampliCan, any samples with less than 30× paired-end (60× single-read) coverage were excluded from further analysis.

Figure 2A plots the proportion of mutated reads and the proportion of reads with a frameshift mutation at each locus, as computed by ampliCan. If a read contained multiple indels, ampliCan summed them to conclude whether the read had a frameshift mutation or not. This frameshift prediction may be inaccurate in some rare cases where an indel was in an intron or disrupts an exon/intron boundary.

There may be cases where an indel at a downstream locus restored the correct frame, which had been shifted at one of the upstream targets. The model of biallelic knockout by frameshift (Figure 1B) assumes this situation also lead to a knockout.

The following refers to Figure 1E,F and Figure 2D. If a single locus is targeted, the proportion of frameshift alleles ( $p_{s h i f t}$ ) was equal to the proportion of reads with a frameshift mutation, as counted by ampliCan in the MiSeq data. If only the first locus is targeted, the proportion of non-frameshift alleles is equal to the proportion of reads that did not have a frameshift mutation at this locus (either not mutated or total indel was of a length multiple of three), $p_{n o t s h i f t_{1}}$ . If a second locus is targeted, proportion of frameshift alleles so far is $p_{s h i f t_{1 \to 2}} = 1 - p_{s h i f t_{2}} \times p_{n o t s h i f t_{1}}$ , and proportion of non-frameshift alleles so far is $p_{n o t s h i f t_{1 \to 2}} = 1 - p_{s h i f t_{1 \to 2}}$ , and so on. At locus $l$ ;

This was done for each animal individually.

The order of the loci at each gene (locus $1, 2, \dots, l$ in Equation 1) follows the ranking of crRNAs in IDT’s database, i.e. the alphabetical order of the locus names in Figure 2A.

Equation 1 assumes that genotypes at each locus of the allele were randomly assigned, i.e. that finding indel $x$ at the first locus does not make it more or less likely to find indel $y$ at the second locus of the same allele. While mutations at each locus may be independent events initially, some alleles might be disproportionately replicated across cell divisions, therefore it is an approximation. Equation 1 also assumes that reads at each locus were randomly sampled from the pool of alleles.

This refers to Figure 2B. Reads from control larvae were not used in this analysis. From each sample, the top 10 most frequent indels were extracted. Any sample with less than 10 indels in total was discarded for this analysis. Pairwise intersections were then performed between each sample’s top 10, each time counting the number of common indels, defined as having the same start and end positions, the same type (deletion or insertion) and in the case of insertion, the same inserted sequence. Positions are given by ampliCan in relation to the protospacer adjacent motif (PAM) position, which is set at 0. The results were then grouped whether the intersection was performed between two samples from different loci (e.g. slc24a5, locus D, fish 1 vs csnk1db, locus A, fish 4; n = 6286 intersections) or two samples from the same locus but different fish (e.g. slc24a5, locus D, fish one vs slc24a5, locus D, fish 4; n = 155 intersections).

This refers to Figure 2C. Reads from control larvae were not used in this analysis. Only unique mutations in each sample were considered here. Duplicates were defined as any indel from the same sample, at the same positions, of the same length and type (insertion or deletion), and in the case of insertion with the same inserted sequence. This was to control as far as possible for coverage bias, i.e. the mutations from a sample with a particularly high coverage would be over-represented in the counts. Considering only unique mutations approximated the probability of each indel length after a double-strand break repair event. For example, a mutation occurring early, for instance at the two-cell stage, would then be replicated many times across cell divisions. The proportion of such a mutation in the final dataset would be high but would not necessarily reflect how likely this indel length was to occur during the repair of the Cas9-induced double-strand break. The counts of unique mutations from all samples were pooled then tallied by length. The frequencies in Figure 2C are the proportions of unique indels of these lengths in the final dataset. There is likely a modest bias against large indels, as they may be missed by the short MiSeq reads or may disrupt a PCR primer binding site and therefore not be amplified. Conversely, the frequency of the larger deletions may be slightly over-estimated due to PCR length bias (Dabney and Meyer, 2012).

This refers to Figure 2—figure supplement 1. Reads from control larvae were not used in this analysis. Only unique deletions in each sample were considered here. For each indel, ampliCan provides the start and end positions in relation to the PAM position, i.e. the PAM nucleotide adjacent to the gRNA binding site is set at 0. If the gRNA binding site is on the positive strand, negative positions are on the 5'-side of the first PAM nucleotide and positive positions to the 3'-side of the first PAM nucleotide. If the gRNA binding site is on the negative strand, negative positions are on the 3'-side of the first PAM nucleotide and positive positions to the 5'-side of the first PAM nucleotide.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol