Introducing nextstrain run

Thomas Sibley

One of our planned directions for 2025 is to facilitate running pathogen workflows with user data and user config. At the beginning of May we took a big step toward reaching that goal, with the introduction of the new nextstrain run command in the version 10.0.0 release of our Nextstrain CLI package. This new command was accompanied by new supporting features in the existing nextstrain setup, nextstrain update, and nextstrain version commands.

nextstrain run is a new way of running our pathogen workflows without the co-mingling of your input and output files with the workflow's own source code.

Indeed, you don't even need to download the pathogen repository source code yourself or manage updates to it with Git at all. (Git is a very useful tool for software development, but it's a huge source of confusion and frustration for people who just want to run the workflows.)

Instead, you provide an analysis directory containing your input files. For example, your input files might include a config.yaml file to adjust filtering and subsampling parameters and a pair of metadata.tsv and sequences.fasta files containing additional private data. This analysis directory is used as the working directory for the workflow run and is where all output files will be created as well. For example, for a phylogenetic workflow, the results/ directory containing alignments and the auspice/ directory containing dataset JSONs will both be within your analysis directory. Multiple separate analysis directories—for example, with different configs, input data, etc.—may be used for concurrent runs of the same pathogen workflow without conflict, allowing for independent outputs and analyses.

Getting started with a pathogen's workflows doesn't involve Git like in the past. Instead, you initially set up the pathogen with nextstrain setup (e.g. nextstrain setup measles). This downloads the pathogen's files into an isolated and automatically-managed location. (And while you’re not expected to ever need to dig into that location, it’s not hidden from you either.) The latest version of the pathogen is set up by default, but multiple specific versions may be set up and run independently without conflict, allowing for comparisons of output across versions.

Over time, you can update the pathogen with nextstrain update (e.g. nextstrain update measles). Pathogens have both released versions (e.g. v1) which don't change after release and unreleased development versions (e.g. main) that are continually changing. Updates happen between released versions (e.g. v1v2) and in-place for development versions.

You can see what pathogens (and what versions) you have set up with nextstrain version --pathogens. (Adding the --verbose flag shows more details.)

Compared to nextstrain build, this new nextstrain run command is a higher-level interface to running pathogen workflows that provides benefits like concurrent independent runs, versioning, and updates while not requiring knowledge of Git or management of pathogen repositories and source code. For active authorship and development of workflows, the nextstrain build command remains more suitable for many tasks. (For now!)

It's still early days, and right now, only measles and zika support nextstrain run and the overall configurability of these repos is limited at that. Still, you can try the basics out yourself by following the example shell session at the end of this post.

Support in more pathogens is coming, along with increased configurability and ease of adding your own data. We're also working on standardizing and documenting the workflow interfaces (e.g. config, inputs, etc).

All of this work is part of our broader “workflows as programs” endeavor where we’re making our complex pathogen workflows easier to run with your data and more like typical, mature bioinformatics programs. nextstrain run is a big step forward in that direction, but there's still lots of work to do. You can follow along with much of the nitty gritty of development in our public tracking issue. If you have questions or comments, please feel free to post to our discussion forum, where many members of the Nextstrain team participate.


$ nextstrain setup measles
Setting up measles@main…
[…]
All good!  Set up of measles@main complete.

$ nextstrain version --pathogens
Nextstrain CLI 10.2.0 (standalone)

Pathogens
  measles
    measles@main (default)

$ mkdir /tmp/example-analysis
$ nextstrain run measles phylogenetic /tmp/example-analysis
Running the 'phylogenetic' workflow for pathogen measles@main
[…]

$ tree /tmp/example-analysis
/tmp/example-analysis
├── auspice
│   ├── measles_genome.json
│   ├── measles_genome_tip-frequencies.json
│   ├── measles_N450.json
│   └── measles_N450_tip-frequencies.json
├── data
│   └── […]
└── results
    ├── genome
    │   ├── aa_muts.json
    │   ├── aligned.fasta
    │   ├── nt_muts.json
    │   ├── tree.nwk
    │   └── […]
    └── N450
        └── […]

6 directories, 28 files

$ nextstrain view /tmp/example-analysis

——————————————————————————————————————————————————————————————————————————————
    The following datasets and/or narratives should be available in a moment:
       • http://127.0.0.1:4000/measles/genome
       • http://127.0.0.1:4000/measles/N450
——————————————————————————————————————————————————————————————————————————————
[…]

All source code is freely available under the terms of the GNU Affero General Public License. Screenshots may be used under a CC-BY-4.0 license and attribution to nextstrain.org must be provided.

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Josh Batson, David Blazes, Jesse Bloom, Peter Bogner, Anderson Brito, Matt Cotten, Ana Crisan, Tulio de Oliveira, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Nate Grubaugh, Becky Kondor, Dylan George, Ian Goodfellow, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Duncan MacCannell, Erick Matsen, Sebastian Maurer-Stroh, Placide Mbala, Danny Park, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Cecile Viboud for comments, suggestions and data sharing.

Nextstrain is supported by

logologologologologologologologo