We downloaded 9683 NS3 genotype 1a aligned protein sequences (coverage ≥ 99%) from the HCV-GLUE database, http://hcv.glue.cvr.ac.uk5,6. We conducted principal component analysis (PCA) of the pair-wise similarity matrix (9683 × 9683) constructed from the sequence data84 to remove 148 outlying sequences. Briefly, all those sequences were considered outliers that appeared at more than 3 scaled median absolute deviations away from the median of either the first or second PC85. The scaled median absolute deviations is given by: , where A is the first or second PC, , and erfcinv() is the inverse complementary error function. To avoid unnecessary patient bias that can compromise model predictive ability (Supplementary Figure 7), we excluded 2167 sequences that were not associated with any patients. These filtering procedures resulted in M = 7370 sequences (accession numbers listed in Supplementary Data 2) from W = 4773 patients. Next, we excluded from this data 116 fully conserved residues, i.e., residues where no mutation was observed in any sequence. This excluded residue 156 from our analysis as it was fully conserved in our data, and therefore, DRMs associated with it were not investigated in our work. The final multiple sequence alignment (MSA) comprised M = 7370 sequences and N = 515 residues.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.