Illumina MiSeq data analysis

FK François Kroll
GP Gareth T Powell
MG Marcus Ghosh
GG Gaia Gestri
PA Paride Antinucci
TH Timothy J Hearn
HT Hande Tunbak
SL Sumi Lim
HD Harvey W Dennis
JF Joseph M Fernandez
DW David Whitmore
ED Elena Dreosti
SW Stephen W Wilson
EH Ellen J Hoffman
JR Jason Rihel
request Request a Protocol
ask Ask a question
Favorite

Illumina MiSeq data was received as two fastq files for each well, one forward and one reverse. The paired-end reads were aligned to the reference amplicon with the package bwa v0.7.17 and the resulted bam alignment file was sorted and indexed with samtools v1.9 (Li et al., 2009). To keep only high-quality reads, any read shorter than 140 bp, with a Phred quality score below 40, or with more than 20% of its length soft-clipped were discarded from the bam file before analysis. Whenever necessary, bam alignment files were visualised with IGV v2.4.10. The resulting filtered bam file was converted back to a forward and a reverse fastq file using bedtools v2.27.1 (Quinlan and Hall, 2010). The filtered fastq files were used as input to the R package ampliCan (Labun et al., 2019), together with a csv configuration file containing metadata information about the samples. AmpliCan was run with settings min_freq = 0.005 (any mutation at a frequency below this threshold was considered as a sequencing error) and average_quality = 25; other parameters were left as default. AmpliCan detected and quantified mutations in the reads and wrote results files that were used for subsequent analysis. Reads from uninjected or scrambled-injected controls were used to normalise the mutation counts, i.e. any mutation present in the control embryo was not counted as a Cas9-induced mutation in the injected ones. Downstream of ampliCan, any samples with less than 30× paired-end (60× single-read) coverage were excluded from further analysis.

Figure 2A plots the proportion of mutated reads and the proportion of reads with a frameshift mutation at each locus, as computed by ampliCan. If a read contained multiple indels, ampliCan summed them to conclude whether the read had a frameshift mutation or not. This frameshift prediction may be inaccurate in some rare cases where an indel was in an intron or disrupts an exon/intron boundary.

There may be cases where an indel at a downstream locus restored the correct frame, which had been shifted at one of the upstream targets. The model of biallelic knockout by frameshift (Figure 1B) assumes this situation also lead to a knockout.

The following refers to Figure 1E,F and Figure 2D. If a single locus is targeted, the proportion of frameshift alleles (pshift) was equal to the proportion of reads with a frameshift mutation, as counted by ampliCan in the MiSeq data. If only the first locus is targeted, the proportion of non-frameshift alleles is equal to the proportion of reads that did not have a frameshift mutation at this locus (either not mutated or total indel was of a length multiple of three), pnotshift1. If a second locus is targeted, proportion of frameshift alleles so far is pshift12=1pshift2×pnotshift1, and proportion of non-frameshift alleles so far is pnotshift12=1 pshift12, and so on. At locus l;

This was done for each animal individually.

The order of the loci at each gene (locus 1, 2, , l in Equation 1) follows the ranking of crRNAs in IDT’s database, i.e. the alphabetical order of the locus names in Figure 2A.

Equation 1 assumes that genotypes at each locus of the allele were randomly assigned, i.e. that finding indel x at the first locus does not make it more or less likely to find indel y at the second locus of the same allele. While mutations at each locus may be independent events initially, some alleles might be disproportionately replicated across cell divisions, therefore it is an approximation. Equation 1 also assumes that reads at each locus were randomly sampled from the pool of alleles.

This refers to Figure 2B. Reads from control larvae were not used in this analysis. From each sample, the top 10 most frequent indels were extracted. Any sample with less than 10 indels in total was discarded for this analysis. Pairwise intersections were then performed between each sample’s top 10, each time counting the number of common indels, defined as having the same start and end positions, the same type (deletion or insertion) and in the case of insertion, the same inserted sequence. Positions are given by ampliCan in relation to the protospacer adjacent motif (PAM) position, which is set at 0. The results were then grouped whether the intersection was performed between two samples from different loci (e.g. slc24a5, locus D, fish 1 vs csnk1db, locus A, fish 4; n = 6286 intersections) or two samples from the same locus but different fish (e.g. slc24a5, locus D, fish one vs slc24a5, locus D, fish 4; n = 155 intersections).

This refers to Figure 2C. Reads from control larvae were not used in this analysis. Only unique mutations in each sample were considered here. Duplicates were defined as any indel from the same sample, at the same positions, of the same length and type (insertion or deletion), and in the case of insertion with the same inserted sequence. This was to control as far as possible for coverage bias, i.e. the mutations from a sample with a particularly high coverage would be over-represented in the counts. Considering only unique mutations approximated the probability of each indel length after a double-strand break repair event. For example, a mutation occurring early, for instance at the two-cell stage, would then be replicated many times across cell divisions. The proportion of such a mutation in the final dataset would be high but would not necessarily reflect how likely this indel length was to occur during the repair of the Cas9-induced double-strand break. The counts of unique mutations from all samples were pooled then tallied by length. The frequencies in Figure 2C are the proportions of unique indels of these lengths in the final dataset. There is likely a modest bias against large indels, as they may be missed by the short MiSeq reads or may disrupt a PCR primer binding site and therefore not be amplified. Conversely, the frequency of the larger deletions may be slightly over-estimated due to PCR length bias (Dabney and Meyer, 2012).

This refers to Figure 2—figure supplement 1. Reads from control larvae were not used in this analysis. Only unique deletions in each sample were considered here. For each indel, ampliCan provides the start and end positions in relation to the PAM position, i.e. the PAM nucleotide adjacent to the gRNA binding site is set at 0. If the gRNA binding site is on the positive strand, negative positions are on the 5'-side of the first PAM nucleotide and positive positions to the 3'-side of the first PAM nucleotide. If the gRNA binding site is on the negative strand, negative positions are on the 3'-side of the first PAM nucleotide and positive positions to the 5'-side of the first PAM nucleotide.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A