We will first make the build step-by-step using an example data set. Then we will see how to automate this stepwise process by defining a pathogen build script which contains the commands we will run below.
If you haven’t already worked through the Quickstart, you may want to look through that before before continuing with this tutorial.
Note that we will not use all the bioinformatics commands possible with Nextstrain. After running this tutorial, you may want to read more about the bioinformatics commands offered by Nextstrain [external link].
Nextstrain builds typically require the following steps:
However, the commands that make up each step can vary depending on your pathogen and your analysis. Here, we’ll follow these steps:
Prepare pathogen sequences and metadata
Construct a phylogeny
Annotate the phylogeny
Export the final results into auspice-readable format
The data in this tutorial is public and is a subset of the data from Lee et al.’s 2015 paper Population genomics of Mycobacterium tuberculosis in the Inuit. As location was anonymized in the paper, location data provided here was randomly chosen from the region for illustrative purposes.
First, download the Tuberculosis (TB) build which includes example data and a pathogen build script. Then enter the directory you just cloned.
git clone https://github.com/nextstrain/tb.git cd tb
Next, if you’re using the Nextstrain CLI tool, use it to enter the Nextstrain build environment by running:
nextstrain shell .
Note the dot (
.) as the last argument; it is important and indicates that your current directory (
tb/) is the build directory.
Your command prompt will change to indicate you are in the build environment.
(If you want to leave the build environment, run the command
If instead you installed Nextstrain components using Conda, remember to activate your environment by running:
conda activate nextstrain
A Nextstrain build with VCF file input starts with:
There are other files you will need if you want to perform certain steps, like masking. If you are also working with TB sequences, you may be able to use the files provided here on your own data. Otherwise, you’ll need to provide files specific to your pathogen.
Here, our input file is compressed with gzip - you can see it ends with
augur can take gzipped or un-gzipped VCF files.
It can also produce either gzipped or un-gzipped VCF files as output.
Here, we’ll usually keep our VCF files gzipped, by giving our output files endings like
.vcf.gz, but you can specify
All the data you need to make the TB build is in the
Sometimes you may want to exclude certain sequences from analysis.
You may also wish to downsample your data based on certain criteria.
filter lets you do this.
For more information on all the ways you can filter data, see the bioinformatics commands reference [external link]
For this example, we’ll just exclude sequences in the file
First, we’ll make a folder to hold all of our results from each step:
mkdir -p results
Now run filter:
augur filter \ --sequences data/lee_2015.vcf.gz \ --metadata data/meta.tsv \ --exclude config/dropped_strains.txt \ --output results/filtered.vcf.gz
There may be regions in your pathogen sequences that are unreliable. For example, areas that are hard to map because of repeat regions. Often, these are excluded from analysis so that incorrect calls in these areas don’t influence the results. The areas to be masked are specified in a BED-format file.
augur mask \ --sequences results/filtered.vcf.gz \ --mask config/Locus_to_exclude_Mtb.bed \ --output results/masked.vcf.gz
Now our sequences are ready to start analysis.
With VCF files, we’ll do this in two steps that are slightly different from FASTA-input.
You can use different tree-building programs to build your initial tree, and specify some parameters.
You can see all the options for
tree in the bioinformatics commands reference [external link]
Here, we pass in the VCF file and the reference it was mapped to.
We also pass in a list of sites that we’d like to exclude from building the topology.
These are sites associated with drug-resistance mutations that can influence the topology.
We exclude them here, but they’ll be allowed to influence branch length and be included in ancestral sequence reconstruction later.
Finally, we use
iqtree as the method to build the tree here.
augur tree \ --alignment results/masked.vcf.gz \ --vcf-reference data/ref.fasta \ --exclude-sites config/drm_sites.txt \ --method iqtree \ --output results/tree_raw.nwk
Now we’ll use the topology from
tree, but get more accurate branch lengths and a time-resolved tree.
This adjusts branch lengths in the tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime.
There are many options that can be specified here in
refine to help you get a good tree.
You can read about them in the bioinformatics commands reference [external link]
refine will produce as output:
augur refine \ --tree results/tree_raw.nwk \ --alignment results/masked.vcf.gz \ --vcf-reference data/ref.fasta \ --metadata data/meta.tsv \ --timetree \ --root residual \ --coalescent opt \ --output-tree results/tree.nwk \ --output-node-data results/branch_lengths.json
In addition to assigning times to internal nodes, the
refine command filters tips that are likely outliers.
Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence.
All other data inferred by TreeTime is stored by strain or internal node name in the JSON file.
Now that we have an accurate tree and some information about the ancestral sequences, we can annotate some interesting data onto our phylogeny. TreeTime can infer ancestral sequences and ancestral traits from an existing phylogenetic tree and metadata to annotate each tip of the tree.
We can reconstruct the ancestral sequences for the internal nodes on our phylogeny and identify any nucleotide mutations on the branches leading to any node in the tree.
You can read about all the options for
ancestral in the bioinformatics commands reference [external link]
For VCF runs,
ancestral will produce another VCF that contains entries for the reconstructed sequence of all the internal nodes, as well as a JSON-format file that contains nucleotide mutation information for each node.
augur ancestral \ --tree results/tree.nwk \ --alignment results/masked.vcf.gz \ --vcf-reference data/ref.fasta \ --inference joint \ --output results/nt_muts.json \ --output-vcf results/nt_muts.vcf
translate we can identify amino acid mutations from the nucleotide mutations and a GFF file with gene coordinate annotations.
The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name.
translate will also produce a VCF-style file with the amino acid changes for each gene and each sequence, and FASTA file with the translated ‘reference’ genes which the VCF-style file ‘maps’ to.
Because of the number of genes in TB, we will only translate some genes to save time. We can pass in a list of genes to translate (genes associated with drug resistance) using
Note that the
--reference-sequence option is how you pass in the GFF file with the gene coordinates.
augur translate \ --tree results/tree.nwk \ --ancestral-sequences results/nt_muts.vcf \ --vcf-reference data/ref.fasta \ --genes config/genes.txt \ --reference-sequence config/Mtb_H37Rv_NCBI_Annot.gff \ --output results/aa_muts.json \ --alignment-output results/translations.vcf \ --vcf-reference-output results/translations_reference.fasta
traits can reconstruct the probable ancestral state of traits like location and host (or others).
This is done by specifying a column or columns in the meta-data file.
--confidence will give confidence estimates for the reconstructed states.
The output will be a JSON file with the state (and confidence, if specified) information for each node.
augur traits \ --tree results/tree.nwk \ --metadata data/meta.tsv \ --columns location \ --confidence \ --output results/traits.json
In the original paper, the authors identified ‘sublineages’ within the dataset.
We can add these to our dataset as ‘clades’ by defining the sublineages with amino-acid or nucleotide mutations specific to that sublineage, given here in the file
As clades, these sublineages will be labeled and we’ll be able to color the tree by them.
augur clades \ --tree results/tree.nwk \ --mutations results/nt_muts.json results/aa_muts.json \ --clades config/clades.tsv \ --output results/clades.json
sequence-traits can identify any trait associated with particular nucleotide or amino-acid mutations, not just drug resistance mutations.
This dataset doesn’t actually contain any drug resistance mutations, but identifying such mutations is often of interest to those working on tuberculosis. Here, we’ll run this step as an example, even though it won’t add anything to the tree for this dataset.
augur sequence-traits \ --ancestral-sequences results/nt_muts.vcf \ --vcf-reference data/ref.fasta \ --translations results/translations.vcf \ --vcf-translate-reference results/translations_reference.fasta \ --features config/DRMs-AAnuc.tsv \ --count traits \ --label Drug_Resistance \ --output results/drms.json
Finally, collect all node annotations and metadata and export it all in auspice’s JSON format. The resulting tree and metadata JSON files are the inputs to the auspice visualization tool.
augur export \ --tree results/tree.nwk \ --metadata data/meta.tsv \ --node-data results/branch_lengths.json \ results/traits.json \ results/aa_muts.json \ results/nt_muts.json \ results/clades.json \ results/drms.json \ --auspice-config config/config.json \ --colors config/color.tsv \ --lat-longs config/lat_longs.tsv \ --output-tree auspice/tb_tree.json \ --output-meta auspice/tb_meta.json
If you entered the Nextstrain build environment using
nextstrain shell at the beginning of this tutorial, leave it now using the
exit command and then use
nextstrain view to visualize the TB build output in
# Leave the shell you entered earlier. exit # View results in your auspice/ directory. nextstrain view auspice/
If you’re not using the Nextstrain CLI shell, then copy the
auspice/*.json files into the
data directory of your local auspice installation and start auspice from there.
You can use the commands below (adjusted if necessary), or copy them using a graphical file explorer.
# Copy files into auspice data directory. Adjust # paths if auspice isn't installed in ~/src/auspice/. mkdir ~/src/auspice/data/ cp auspice/*.json ~/src/auspice/data/
Then enter your
auspice directory and start auspice.
# Navigate into auspice. cd ~/src/auspice/data/ # Start auspice. npm run dev
When auspice is running, navigate to http://localhost:4000/local/tb in your browser to view the results.
To stop auspice and return to the command line when you are done viewing your data, press CTRL+C.
While it is instructive to run all of the above commands manually, it is more practical to automate their execution with a single script.
Nextstrain implements these automated pathogen builds with Snakemake by defining a
Snakefile like the one in the TB repository you downloaded.
First delete the output from the manual steps above.
(Be sure to navigate into the
tb-tutorial/ directory first.)
rm -rf results/ auspice/
Then, if you’re using the Nextstrain CLI tool, run:
nextstrain build .
to run the automated pathogen build.
If you’re not using the Nextstrain CLI tool, run:
The automated build runs all of the manual steps above up through the auspice export. View the results the same way you did before to confirm it produced the same TB build you made manually.
Note that automated builds will only re-run steps when the data changes.
This means builds will pick up where they left off if they are restarted after being interrupted.
If you want to force a re-run of the whole build, first remove any previous output with
nextstrain build . clean or
This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Allison Black, David Blazes, Peter Bogner, Matt Cotten, Ana Crisan, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Becky Garten, Dylan George, Ian Goodfellow, Nathan Grubaugh, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Sebastian Maurer-Stroh, Louise Moncla, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Nathan Yozwiak for comments, suggestions and data sharing.
© 2015-2019 Trevor Bedford and Richard Neher