We explicitly assume that the user has some experience working with shell commands on a Linux-based operating system and has superuser privileges.
The individual steps involved in this protocol and the Augur modules used in each step are summarized in Figure 1.
-
Install Docker Engine
Docker is an open-source technology based on virtualization, which is used for developing and running software applications in the form of containers. The Docker Engine can be installed using the following commands:
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
To activate and test Docker installation, execute the following commands:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world

Figure 1. The different steps described in this protocol and the Augur modules used in each of the analysis steps
-
Install Anaconda
Anaconda is an open-source distribution of Python that simplifies the management of Python packages and environments. To install Anaconda, use the following commands:
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash Anaconda3-2020.02-Linux-x86_64.sh
Proceed with the installation by following the on-screen instructions. You can find the anaconda3 folder in the directory shown in the installer script. You can activate and test your installation by running the following commands:
source ~/.bashrc
conda list
-
Install Nextstrain-CLI
Nextstrain is available as a Python package and can be installed using pip.
python3 -m pip install nextstrain-cli
To check whether Nextstrain has been successfully installed, use the following command:
nextstrain version
The version number shown in the output should be 1.16.1 or higher.
-
Install Augur
Augur is the toolkit provided by Nextstrain for phylogenetic analysis. Augur is also available as a Python package and can be installed using the following command:
python3 -m pip install nextstrain-augur
-
Install MAFFT
MAFFT (Multiple Alignment using Fast Fourier Transform) is required by Augur to perform multiple-sequence alignments. To download and install this tool, use the following command:
sudo apt-get install mafft
-
Install IQ-TREE
IQ-TREE is an open-source tool for constructing maximum-likelihood trees using phylogenetic data. IQ-TREE is required by Augur for constructing a phylogenetic tree from sequence data. To install IQ-TREE use the following command:
sudo apt-get install iqtree
It is recommended to use IQ-TREE version 1.6.1 (default version installed for Ubuntu 18.04 LTS) or higher.
-
Download the SARS-CoV-2 sequence dataset
The Global Initiative on Sharing All Influenza Data (GISAID) is the most updated public repository of SARS-CoV-2 genome sequences. For this phylogenetic clustering protocol, we downloaded the dataset of ~15,000 complete (as of 1st May 2020) SARS-CoV-2 genome sequences from GISAID. The database can be accessed by registering for a GISAID account. Upon successful activation, the sequence dataset can be downloaded by logging into the GISAID EpiCoVTM database and navigating to the Browse option (https://www.epicov.org/epi3/frontend).
To create the metadata file required by Augur, you will also need to download the Acknowledgment Table for all submissions provided by GISAID, which can also be found on the Browse page.
-
Download the SARS-CoV-2 reference genome
Before proceeding with the analysis, you also need to download the reference genome for SARS-CoV-2 from NCBI in GenBank (.gb). For this analysis, we downloaded the genome with the accession number MN908947.3.
-
Preparing input files
To use Nextstrain for phylogenetic analysis and visualization, you need to prepare the following input files (Table 1):
Table 1. List of input files required to run the different steps in the analysis pipeline

-
sequences.fasta
A single FASTA file containing a collection of pathogen sequences to be analyzed. For this analysis, we used the sequence dataset downloaded from GISAID. Each sequence in the FASTA file should have the strain ID of the virus as the sequence header. A sample sequence record for the FASTA file is shown in Figure 2.

Figure 2. Sample record for the hCoV-19/India/1-27/2020 SARS-CoV2 strain in the sequences.fasta format
-
metadata.tsv
A tab-delimited metadata file that describes the sequences given in the FASTA file. The various fields to be included in the metadata file are as follows:
-
Required fields: Strain, Virus, Date
For each strain ID in the sequences.fasta file, there should be an entry under the strain column in the metadata file.
-
Additional fields (if using published data): Accession, Authors, URL, Title, Journal, Paper_URL.
-
To infer ancestral traits, additional information fields such as region, country, state, and city need to be included in the metadata file.
The information for the various fields in the metadata file can be taken from the Acknowledgment Table downloaded from GISAID. A sample metadata spreadsheet is linked here as Supplementary Data 1.
-
clades.tsv
This file is required for the addition of clade labeling to the phylogenetic tree. The file specifies the mutations (amino acid or nucleotide) specific to a particular clade of the virus (Figure 3). The clades.tsv file should contain the following fields:
-
clade: To describe the name of a clade.
-
gene: The name of the gene in which the mutation lies (for nucleotide changes, the gene name should be ‘nuc’).
-
site: The position of the mutation within the genome.
-
alt: The mutated amino acid or nucleotide found at that position.
For this analysis, we used the clades definition for SARS-CoV-2 genomes defined by Nextstrain ( https://github.com/nextstrain/ncov ).

Figure 3. Summary screenshot of the clades.tsv file provided by Nextstrain for SARS-CoV-2 genomes
-
auspice_config.json
This file is needed to set various display options for visualization. A sample config file is linked here as Supplementary Data 2.
-
lat_longs.tsv
A tab-separated file containing latitudes and longitudes for all regions, countries, states, and cities in the dataset (Figure 4). This file will be used to display geographic traits during visualization.

Figure 4. Summary screenshot of the lat_longs.tsv file required by Nextstrain for visualizing geographic traits
-
Quality assessment
In this visualization, we would also like to segregate high-quality FASTA sequences in the dataset from low-quality ones. Accordingly, we added an additional field, ‘quality,’ to the metadata file. The following quality metrics define a high-quality sequence:
-
Percentage identity to the reference genome after pairwise alignment: >99%
-
Percentage of gaps in the alignment: <1%
-
Percentage of N (unknown nucleic acid residue) bases in the sequence: <1%
-
No degenerate bases in the sequence
Based on the above criteria, the ‘quality’ metadata field can hold the values, ‘High,’ ‘Low,’ and ‘Not Assessed.’
To visualize the quality assessment, we created an additional configuration file ‘colors.tsv,’ a tab-delimited file containing hex codes for each value of the sequence quality field that you want to represent. In this analysis, high-quality is shown in green, low-quality in red, and unassessed sequences in yellow by specifying the corresponding hex codes for the required colors in the ‘colors.tsv’ file (Figure 5).

Figure 5. Summary screenshot of the colors.tsv file created for visualizing sequence quality
Due to legibility and performance constraints, Nextstrain can only handle ~3,000 sequences in a single view. Since we are working with a set of ~15,000 genome sequences, we subsampled our data and analyzed them by focusing on an individual geographic region (i.e., India).
-
Filter sequences
The input sequence set can be filtered based on certain criteria and subsampled using this command. The following command will filter the SARS-CoV2 sequences based on their submission dates and group them by country, year, and month. All sequences dated prior to 2013 or possessing a missing date record will be dropped. The global data will also be subsampled to 100 sequences per country per year per month.
augur filter --sequences <sequences.fasta> --metadata <metadata.tsv> --output <filtered_ncov.fasta> --group-by country year month --sequences-per-group 100 --min-date 2013
To focus on a particular geographic region, the filter command also contains parameters that help to include or exclude certain sequences from the analysis:
--include <include_file> This constraint can be used to include sequences regardless of other subsampling criteria. For this analysis, the include_file will contain the line hCoV-19/Wuhan/WH01/2019, since we will be using this genome as the root in the phylogenetic tree. The names of any other sequences that you want to include in your analysis can be added to this file.
--exclude-where <CONDITION> This constraint will be used for focusing the analysis on a particular region.
To subsample the dataset for a single geographic region, use the following command:
augur filter --sequences <sequences.fasta> --metadata <metadata.tsv> --output <filtered_ncov_india.fasta> --exclude-where country!=India --include <include_file>
-
Alignment to the reference genome
Augur uses MAFFT to perform multiple-sequence alignments. To create an alignment file using Augur use the following command:
augur align --sequences <filtered_ncov.fasta> --reference-sequence <MN908947.gb> --output <aligned_ncov.fasta> --nthreads <2> --remove-reference --fill-gaps
For the geographic region-focused analysis, use the following command:
augur align --sequences <filtered_ncov_india.fasta> --reference-sequence <MN908947.gb> --output <aligned_ncov_india.fasta> --nthreads <2> --remove-reference --fill-gaps
-
Constructing the phylogenetic tree
Augur uses IQTREE as the default software to construct a phylogenetic tree from the multiple-sequence alignment file. The branch lengths in the tree are a measure of nucleotide divergence. The following command will generate a phylogenetic tree in Newick format (.nwk):
augur tree --alignment <aligned_ncov.fasta> --output <raw_tree_ncov.nwk> --nthreads <4>
For the geographic region-focused analysis, use the following command:
augur tree --alignment <aligned_ncov_india.fasta> --output <raw_tree_ncov_india.nwk> --nthreads <4>
-
Refining the phylogenetic tree
The raw tree constructed in the previous step can be further processed by Augur using TreeTime to adjust the branch lengths according to the sampling dates of the sequences. For this analysis, we specified the root of the tree by giving the sequence name hCoV-19/Wuhan/WH01/2019 explicitly with the --root parameter of the refine command. The --clock-rate parameter was used to run the analysis using a fixed evolutionary rate to produce a robust time-resolved phylogeny, and the --clock-filter-iqd parameter filters out genomes that do not follow the evolutionary rate or molecular clock. For SARS-CoV-2 genomes, this rate is fixed at 0.0008 or 8 × 10-4 substitutions per site per year. To produce a time-resolved tree use the following command:
augur refine --tree <raw_tree_ncov.nwk> --alignment <aligned_ncov.fasta> --metadata <metadata.tsv> --output-tree <refined_ncov_tree.nwk> --output-node-data <branch_lengths_ncov.json> --root hCoV-19/Wuhan/WH01/2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-inference marginal --divergence-unit mutations --date-confidence --no-covariance --clock-filter-iqd 4
For the geographic region-focused analysis, use the following command:
augur refine --tree <raw_tree_ncov_india.nwk> --alignment <aligned_ncov_india.fasta> --metadata <metadata.tsv> --output-tree <refined_ncov_tree_india.nwk> --output-node-data <branch_lengths_ncov_india.json> --root hCoV-19/Wuhan/WH01/2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-inference marginal --divergence-unit mutations --date-confidence --no-covariance --clock-filter-iqd 4
-
Annotating ancestral traits
Augur can use the time tree to infer the region and country of all internal nodes. The ancestral traits for all nodes can be annotated using the following command:
augur traits --tree <refined_ncov_tree.nwk> --metadata <metadata.tsv> --output <ncov_traits.json> --columns region country --confidence --sampling-bias-correction 2.5
For the geographic region-focused analysis, use the following command:
augur traits --tree <refined_ncov_tree_india.nwk> --metadata <metadata.tsv> --output <ncov_traits_india.json> --columns city --confidence --sampling-bias-correction 2.5
-
Inferring ancestral sequences and nucleotide mutations
The following command will identify the nucleotide mutations of the branches of the tree and infer the ancestral strain of each node:
augur ancestral --tree <refined_ncov_tree.nwk> --alignment <aligned_ncov.fasta> --output-node-data <ncov_nt_muts.json> --inference joint --infer-ambiguous
For the geographic region-focused analysis, use the following command:
augur ancestral --tree <refined_ncov_tree_india.nwk> --alignment <aligned_ncov_india.fasta> --output-node-data <ncov_nt_muts_india.json> --inference joint --infer-ambiguous
-
Inferring amino acid mutations
The following command will identify the amino acid mutations using the reference genome and ancestral sequences:
augur translate --tree <refined_ncov_tree.nwk> --ancestral-sequences <ncov_nt_muts.json> --reference-sequence <MN908947.gb> --output <ncov_aa_muts.json>
For the geographic region-focused analysis, use the following command:
augur translate --tree <refined_ncov_tree_india.nwk> --ancestral-sequences <ncov_nt_muts_india.json> --reference-sequence <MN908947.gb> --output <ncov_aa_muts_india.json>
-
Identifying clades
The following command will label clades within the dataset using the nucleotide and amino acid mutations specified in the clades.tsv file:
augur clades --tree <refined_ncov_tree.nwk> --mutations <ncov_aa_muts.json> <ncov_nt_muts.json> --clades <clades.tsv> --output-node-data <ncov_clades.json>
For the geographical region-focused analysis, use the following command:
augur clades --tree <refined_ncov_tree_india.nwk> --mutations <ncov_aa_muts_india.json> <ncov_nt_muts_india.json> --clades <clades.tsv> --output-node-data <ncov_clades_india.json>
-
Exporting output files for visualization
The following command will export all output files generated in the previous steps of the analysis as a single JSON file to visualize the data using Nextstrain:
augur export v2 --tree <refined_ncov_tree.nwk> --metadata <metadata.tsv> --node-data <branch_lengths_ncov.json> <ncov_aa_muts.json> <ncov_nt_muts.json> <ncov_traits.json> <ncov_clades.json> --auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv --output auspice/COVID_global.json
For the geographic region-focused analysis, use the following command:
augur export v2 --tree <refined_ncov_tree_india.nwk> --metadata <metadata.tsv> --node-data <branch_lengths_ncov_india.json> <ncov_aa_muts_india.json> <ncov_nt_muts_india.json> <ncov_traits_india.json> <ncov_clades_india.json> --auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv --output auspice/COVID_india.json
-
Viewing the data
To visualize the output, use the following command:
nextstrain view auspice/ --allow-remote-access
This command will start the Auspice server on port 4000. The output can then be visualized through a browser by navigating to http://127.0.0.1:4000/ or using the IP address of the machine on which the Auspice service is running and navigating to http://IP_ADDRESS_OF_MACHINE:4000/. The different subsampled datasets can be found under the ‘Dataset’ dropdown menu (Figure 6).
Note: For the links, the user will need to follow the steps given in the protocol. The hyperlinks correspond to a locally operated server through 'Auspice' (installation and instructions are detailed in the protocol), which helps the user to view the phylogeny on their own system through a browser.

Figure 6. Screenshot of the visualization produced by Nextstrain for the COVID_global and COVID_india datasets