SARS-CoV-2 variant analysis

IP Isabelle Q. Phan
SS Sandhya Subramanian
DK David Kim
MM Michael Murphy
DP Deleah Pettie
LC Lauren Carter
IA Ivan Anishchenko
LB Lynn K. Barrett
JC Justin Craig
LT Logan Tillery
RS Roger Shek
WH Whitney E. Harrington
DK David M. Koelle
AW Anna Wald
DV David Veesler
NK Neil King
JB Jim Boonyaratanakornkit
NI Nina Isoherranen
AG Alexander L. Greninger
KJ Keith R. Jerome
HC Helen Chu
BS Bart Staker
LS Lance Stewart
PM Peter J. Myler
WV Wesley C. Van Voorhis
request Request a Protocol
ask Ask a question
Favorite

A total of 7,948 high coverage sequences from human isolates were downloaded from GISAID on 04/20/2020, representing 6,797 unique DNA sequences. These were annotated as ‘sequences with < 1% Ns and < 0.05% unique amino acid mutations (not seen in other sequences in the database) and no insertion/deletion unless verified by the submitter’. The unique sequences were aligned to the reference genome with MAFFT experimental version 7.463 using options ‘–auto –addfragments’24,25. The misaligned 3.2 kb fragment EPI_ISL_426413 was identified as Hepatitis B virus isolates JRC-HB01 by MEGABLAST against the NCBI nr database and was removed from the alignment. Further, a total of 40 inserts in the alignment that introduced gaps in the reference sequence were deleted as likely sequencing errors. The 11 largest inserts ranged from 99 to 155 bases, and the next largest were 3 base long. The genome submitter confirmed that those large inserts were likely assembly artefacts that will be corrected. Of the smaller inserts, a three base ‘TTT’ in-frame insertion was observed at nsp6 position 298 in 17 sequences, and is presumably real, although it is in a repetitive region of 8 consecutive Ts and therefore at the limit of accurate PCR amplification of mono-thymidine repeats26. Gaps in aligned sequences were likewise treated as likely sequencing errors and replaced with the unknown base to keep translation in frame with respect to the reference sequence, taking into account the ribosomal frameshift in the nsp12 coding sequence. This process resulted in just 95 remaining premature terminations in over 160 K proteins; 43 of these were found consistently after amino acid 125 of nsp10 and likely represent true truncations. The others were distributed across the sequences of 11 protein families and are likely erroneous. The 27 protein alignments thus obtained, one for each of the SARS-CoV-2 proteins selected for this study, were transformed into a n by m matrix, where n is the length of the reference protein and m the length of the alignment, and processed column-wise. At each position in the alignment, the variant count was obtained by counting unique amino acids that differed from the reference; variability was quantified using the Shannon entropy, calculated as SE=-i=1nPi×log2Pi where n is the number of amino-acid types and Pi is the fraction of amino acid type i at that position27,28. Codes for unknown amino acid (X) and translation termination (*) were ignored in all calculations.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A