Standardized Multiple Inputs

The Nextstrain team

The Nextstrain team is continuing our push to facilitate running pathogens workflows with user data and user config as outlined in our planned directions for 2025. We have decided to standardize configuration parameters for defining multiple inputs for phylogenetic workflows and have updated our pathogen-repo-guide with the latest guidance.

Our phylogenetic workflows will define the default inputs for Nextstrain, which usually links to Nextstrain curated data produced from our ingest workflows. Custom builds are then expected to include their own inputs with additional_inputs defined in the config.yaml.

inputs:
  - name: nextstrain
    metadata: "s3://nextstrain-data/files/workflows/<pathogen>/metadata.tsv.zst"
    sequences: "s3://nextstrain-data/files/workflows/<pathogen>/sequences.fasta.zst"

additional_inputs:
  - name: private
    metadata: "data/private_metadata.tsv"
    sequences: "data/private_sequences.fasta"

If you would like to run the phylogenetic workflow without the Nextstrain inputs, then you can use the inputs parameter to completely override them.

inputs:
  - name: private
    metadata: "data/private_metadata.tsv"
    sequences: "data/private_sequences.fasta"

We will be updating our existing pathogen workflows to use the standardized parameters and you can track our progress in our public tracking issue. Pathogen workflows that currently support multiple inputs with the standard parameters are avian influenza, West Nile virus, and zika. The input configuration for the SARS-CoV-2 workflow already supports multiple inputs with additional features, so we will not be updating it to conform to the new standard.

If you have questions or comments, please feel free to post to our discussion forum or create an issue in a pathogen GitHub repository.

All source code is freely available under the terms of the GNU Affero General Public License. Screenshots may be used under a CC-BY-4.0 license and attribution to nextstrain.org must be provided.

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Josh Batson, David Blazes, Jesse Bloom, Peter Bogner, Anderson Brito, Matt Cotten, Ana Crisan, Tulio de Oliveira, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Nate Grubaugh, Becky Kondor, Dylan George, Ian Goodfellow, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Duncan MacCannell, Erick Matsen, Sebastian Maurer-Stroh, Placide Mbala, Danny Park, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Cecile Viboud for comments, suggestions and data sharing.

Nextstrain is supported by

logologologologologologologologo