2025-07-24

New Resources for Rubella

John SJ Anderson

We are pleased to announce the availability of a Rubella phylogenetic analysis and dataset on Nextstrain.org, as well as a Rubella Nextclade dataset. We provide phylogenetic trees based on both whole genome analysis, as well as analysis of the E1 region used in the WHO genotype reference. These trees are automatically rebuilt with new Rubella sequences as they are deposited in NCBI. We welcome feedback from Rubella researchers about any aspect of either the Nextclade dataset or the phylogenetic trees on Nextstrain.org.

# Phylogenetic analysis

Rubella virus (Rubivirus rubellae) is a positive-strand RNA virus with a 9,762 base genome. It causes rubella, a formerly common respiratorily-transmitted illness, primarily infecting children. When rubella infects a pregnant person and spreads to the fetus, it causes congenital rubella syndrome (CRS), which can lead to a wide variety of congenital defects. The earlier in pregnancy infection happens, the more severe these defects tend to be. Rubella has been effectively eradicated in many parts of the world due to successful vaccination programs. Available vaccines provide lifelong immunity after a single dose, usually administered as part of the tri-valent "MMR" vaccine, which also protects against measles and mumps.

The Rubella genome has two open reading frames (NSP and SP), the latter of which encodes three structural proteins, C (capsid), E1, and E2 (envelope proteins). E1 is the source of hemagglutin activity for the virus, and is the only region that is sequenced for many samples. It is also the basis of the WHO genotype reference, which we used for genotype definition in the Nextclade dataset. The WHO nomenclature classifies Rubella into have two large overarching clades (1 and 2), split into specific genotypes (1A-1J and 2A-2C) based on clades observed in trees built with reference samples. (See WHO genotype reference for further specifics.)

Figure 1. Phylogenetic tree of rubella virus whole genome. The phylogenetic tree is colored by genotype, based on reference strains from the WHO genotype reference. Available at nextstrain.org/rubella.

The Nextstrain phylogenies support coloring by both genotypes assigned via the Nextclade dataset, as well as genotypes extracted from parsing strain names and other metadata present in GenBank records. These two methods are broadly, but not 100%, congruent with each other. We also have included a coloring for the specific WHO reference strains, to facilitate visualizing how these strains are positioned in the broader genomic and E1 gene trees.

# Nextclade dataset

We developed an initial version of the Nextclade dataset based strictly on the 32 reference samples from the WHO genotype reference. This initial dataset did not do an adequate job of assigning genotypes to the full trees, so we revised it with the addition of 32 other samples. The genotypes for these samples were based on both annotations in GenBank, and their position in the phylogenetic trees relative to the WHO strains. In general, we tried to select additional samples that were more basal to the WHO samples in the tree, with the intention of capturing additional genotype diversity.

The 1A genotype, as noted in the original WHO nomenclature publication, is polyphyletic. In particular, we observed a clade of approximately 22 samples in the whole genome tree, which appear to reflect more recent vaccine-derived 1A strains, which have a more distal placement in the tree relative to most of the original 1A reference strains. We have elected to make the Nextclade dataset's definition of the 1A genotype based on this more derived clade, rather than the more basal original samples, which do not group with any more recent samples in either the whole genome or E1 region specific trees.

# Nextstrain resources

We curate sequence data and metadata from NCBI as the starting point for our analyses. Curated sequences and metadata are available as flat files at:

# Acknowledgments & request for comments

We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Additionally, we welcome comments or suggestions from rubella researchers on how to improve these Nextstrain datasets for their use case.

Blog Archives

Joint response to GISAID regarding termination of SARS-CoV-2 data feeds (2026-02-24)
Norovirus resources (2026-01-14)
Tuberculosis resources (2025-11-17)
Interruption to GISAID-based SARS-CoV-2 analyses (2025-11-06)
Standardized Multiple Inputs (2025-09-29)
Mumps resources (2025-09-24)
PPX and Nextstrain (2025-08-28)
Rubella resources (2025-07-24)
Upgrade to Snakemake v9 (2025-07-17)
Introducing nextstrain run (2025-06-13)
Conda runtime back in sync (2025-05-12)
Conda runtime out-of-sync (2025-04-22)
Annual Update March 2025 (2025-03-31)
Rabies, Lassa, and Yellow Fever Virus Resources (2025-02-21)
HMPV Phylogenetic Analysis and Resources (2025-01-09)
Notable changes in Auspice (2024-12-16)
Externally-caused Auspice performance issues (2024-12-10)
Oropouche Phylogenetic Analysis and Resources (2024-10-22)
Augur v25 features (2024-09-03)
H5N1 Cattle Outbreak Analysis and Resources (2024-06-18)
New Resources for Measles (2024-06-12)
Annual Update March 2024 (2024-03-27)
SARS-CoV-2 clade naming 2022 (2022-04-29)
Ncov Open Announcement (2021-07-08)
Updated SARS-CoV-2 Clade Naming (2021-01-06)
SARS-CoV-2 Clade Naming (2020-06-02)
Using Narratives To Explain West Nile Virus Spread (2019-10-31)
Auspice v2 (2019-10-21)
New Nextstrain Website (2018-05-14)

Resources

Tools

Support

About

Hadfield et al., Nextstrain: real-time tracking of pathogen evolution, Bioinformatics (2018)

The core Nextstrain team is

Please see the team page for more details.

All source code is freely available under the terms of the GNU Affero General Public License. Screenshots may be used under a CC-BY-4.0 license and attribution to nextstrain.org must be provided.

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Josh Batson, David Blazes, Jesse Bloom, Peter Bogner, Anderson Brito, Matt Cotten, Ana Crisan, Tulio de Oliveira, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Nate Grubaugh, Becky Kondor, Dylan George, Ian Goodfellow, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Duncan MacCannell, Erick Matsen, Sebastian Maurer-Stroh, Placide Mbala, Danny Park, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Cecile Viboud for comments, suggestions and data sharing.

Nextstrain is supported by