Using Pathoplexus as data source for Nextstrain builds

Nextstrain team

Pathoplexus (PPX) is a new database for timely sharing of viral sequence data that was launched one year ago (disclosure: some of us at Nextstrain are involved in Pathoplexus). The goal of Pathoplexus is to facilitate sharing of virus genomes, in a way that ensures that those involved in collecting the samples and generating the sequences get the recognition they deserve, while maximizing the utility of the data to advance public health. Sharing data via PPX is very easy, and submitters can choose to share their data either as "OPEN" or as "RESTRICTED" use. Open sequences are submitted to INSDC (ENA/NCBI/DDBJ) immediately, while restricted use data are made available for public health and surveillance purposes right away on Pathoplexus, while publishing using these data is generally prohibited for up to one year (see Data use terms for details).

Pathoplexus provides a modern API to the data, meaning that it is straightforward and fast to retrieve (and submit) data in an automated way. Nextstrain has started using Pathoplexus as the data source for our automatically updated analyses of RSV. The information on each sample in our trees now includes links to these samples on Pathoplexus and an explicit statement on the data use terms associated with this sample, along with a link to the data use terms on Pathoplexus -- as is required by the Pathoplexus data use terms.

tooltip

The files containing the curated metadata and the sequences that we reshare on Nextstrain contain the subset of the data that is available under 'OPEN' data use terms, which in most cases is going to be the same as data available in NCBI virus (synchronization delays can lead to small, temporary differences). The metadata table contains additional columns linking back to the data source on Pathoplexus and the Pathoplexus accessions. For the time being, we advise users interested in Restricted-Use data to obtain these from Pathoplexus directly and to carefully consult the data use terms.

We are excited about the new ways of sharing pathogen sequence data that Pathoplexus provides and are looking into using Pathoplexus as data source for other analyses such as mpox or metapneumovirus.

All source code is freely available under the terms of the GNU Affero General Public License. Screenshots may be used under a CC-BY-4.0 license and attribution to nextstrain.org must be provided.

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Josh Batson, David Blazes, Jesse Bloom, Peter Bogner, Anderson Brito, Matt Cotten, Ana Crisan, Tulio de Oliveira, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Nate Grubaugh, Becky Kondor, Dylan George, Ian Goodfellow, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Duncan MacCannell, Erick Matsen, Sebastian Maurer-Stroh, Placide Mbala, Danny Park, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Cecile Viboud for comments, suggestions and data sharing.

Nextstrain is supported by

logologologologologologologologo