Updated Nextstrain SARS-CoV-2 clade naming strategy

Trevor Bedford, Emma B Hodcroft, Richard A Neher

The emerging 501Y.V1 and 501Y.V2 variants have pushed the Nextstrain team to revise our strategy for Nextstrain clade labels. Here, we propose a backwards-compatible update to make clades more adaptable to the continuing pandemic situation, and more useful to people working on the pandemic today.

In June we put forth an initial Nextstrain clade naming strategy. This basic strategy of flat "year-letter" names was borne out of work with seasonal influenza, where the nested names of 3c2.A1b (etc...) can become unwieldy. In the “year-letter” scheme, years are there to make it easy to know what's being discussed in ~5 years when, for example, clade 20A is referenced. Our June strategy called for naming of a clade when it reached >20% global frequency for more than 2 months.

However, as the pandemic progressed, lack of international travel made it so that no clades beyond the initial clades 20A, 20B and 20C made it past 20% global frequency. Instead, we've seen "regional" clades that hit appreciable frequency in different continent-level regions of the world. One example is 20A.EU1, which has risen to high frequency in Europe in particular. When clusters like 20A.EU1 and 20A.EU2 were originally described in October, it seemed like labeling based on regional vs global circulation was of benefit. However, the emergence of fast-spreading "variants" has made it clear that a region-based naming system will have drawbacks when spread is rapidly more global.

Additionally, we recognize that complex, unintuitive names lead to geographic-based terms like "UK variant", which can be harmful to the country involved. Therefore, having relatively simple, intuitive official names without geography is important. Finally, there is the issue where if 20A.EU1 did expand to >20% global frequency, it would be confusing to relabel it from 20A.EU1 to, for example, 20E.

Consequently, we propose an updated strategy, where major (year-letter) clades are named when any of the following criteria are hit:

  1. A clade reaches >20% global frequency for 2 or more months
  2. A clade reaches >30% regional frequency for 2 or more months
  3. A VOC ('variant of concern') is recognized (applies currently to 501Y.V1 and 501Y.V2)

This results in the updated clade definitions, and the resulting Nextstrain outputs can be seen at:

There are 9 major clades identified for 2020 with 6 new clades being added in addition to the original 20A, 20B and 20C. These are 20A through 20I. Ordering is based on estimated TMRCA following our original proposal: "we propose to name major clades by the year they are estimated to have emerged and a letter, e.g. 19A, 19B, 20A." Clade 20E (EU1) is the elevated clade 20A.EU1 where the "EU1" parenthetical is retained to help connect these labels.

nextstrain-global-ncov-clades Fig 1. Nextstrain 'global' run with the new Nextstrain major clades labelled.

Importantly, we propose to dual label major clades if they correspond to an emerging “variant of concern” (VOC), so that we have 20H/501Y.V2 and 20I/501Y.V1. We believe it's useful and informative to have the genetic short-hand for these, as it is self-documenting. In a hypothetical example, if we had a future VOC bearing a hallmark spike 484K mutation, then this variant would be labeled 484K.V1. In this case, these variants are labeled by the relevant spike mutation along with V1, V2, etc. to disambiguate. These disambiguating numbers are assigned in order of identification or announcement.

Along these lines, we had previously recommended labeling subclades within a major clade by specific nucleotide mutations, where, for example, 20G/1927C refers to viruses bearing nucleotide change 1927C within the 20G clade. We previously avoided using amino acids as this can draw attention to mutations whose effect is unknown. However, as the pandemic has progressed, it has become clear that spike mutations have become a normalized way to identify clusters. Given that amino acid mutations are often of greater familiarity and importance, we would like to extend this nomenclature to label subclades based on amino acid changes, where, for example, 20B/S.484K refers to viruses bearing amino acid change 484K in the spike protein within the 20B clade.

Moving forward, we aim to avoid relabeling. Making clusters of interest major clades more quickly avoids getting into another situation where we'd need to elevate 20A.EU1 to 20E, which adds confusion. However, we recognise that we may not be able to predict the importance of rising variants perfectly, and will retain flexibility in the system (for example parenthetical use) for situations where complete renaming might cause additional confusion. Our new proposal does require a relabeling of 20B/501Y.V1 to 20I/501Y.V1 and a relabeling of 20C/501Y.V2 to 20H/501Y.V2, but given they are primarily referred to by the last part of their names, we feel this is acceptable and will not cause undue confusion.

Going forward, we'll commit to keeping these major clades fresher. In this system we estimate 7 clades currently at >5% global frequency and regions with between 2 and 5 clades currently circulating at >5% regional frequency. We feel this is a good resolution to capture important dynamics without being too overwhelming. Additionally, we hope this modified strategy, outlined above, will prevent us from having to relabel in the future.

At this moment, major clades from 2020 onwards are:

  • 20A: basal pandemic lineage bearing S 614G that's globally distributed
  • 20B: derived from 20A bearing N 203K, N204R and ORF14 50N, also globally distributed
  • 20C: derived from 20A bearing ORF3a 57H and ORF1a 265I, also globally distributed
  • 20D: derived from 20B bearing ORF1a 1246I and ORF1a 3278S, concentrated in South America, southern Europe and South Africa
  • 20E: derived from 20A bearing N 220V, ORF10 30L, ORF14 67F and S 222V, concentrated in Europe
  • 20F: derived from 20B bearing ORF1a 300F and S 477N, concentrated in Australia
  • 20G: derived from 20C bearing ORF1b 1653D, ORF3a 172V, N 67S and N 199L, concentrated in the United States
  • 20H/501Y.V2: derived from 20C bearing S 80A, S 215G, S 484K, S 501Y, S 701V, concentrated in South Africa
  • 20I/501Y.V1: derived from 20B bearing S 501Y, S 570D, S 681H, ORF8 27*, concentrated in the United Kingdom

ncov-clades-schematic Fig 2. Schematic showing hierarchical relationships among clades. An interactive version of this diagram is available here.

All source code is freely available under the terms of the GNU Affero General Public License. Screenshots may be used under a CC-BY-4.0 license and attribution to nextstrain.org must be provided.

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to Kristian Andersen, Josh Batson, David Blazes, Jesse Bloom, Peter Bogner, Anderson Brito, Matt Cotten, Ana Crisan, Tulio de Oliveira, Gytis Dudas, Vivien Dugan, Karl Erlandson, Nuno Faria, Jennifer Gardy, Nate Grubaugh, Becky Kondor, Dylan George, Ian Goodfellow, Betz Halloran, Christian Happi, Jeff Joy, Paul Kellam, Philippe Lemey, Nick Loman, Duncan MacCannell, Erick Matsen, Sebastian Maurer-Stroh, Placide Mbala, Danny Park, Oliver Pybus, Andrew Rambaut, Colin Russell, Pardis Sabeti, Katherine Siddle, Kristof Theys, Dave Wentworth, Shirlee Wohl and Cecile Viboud for comments, suggestions and data sharing.

Nextstrain is supported by

logologologologologologologologo