Fine-scale patterns of SARS-CoV-2 spread from identical pathogen sequences

Data sources and preprocessing

Sequence data and metadata

We analysed 116,791 SARS-CoV-2 sequences from Washington state genomic sentinel surveillance system⁴⁵ sampled between 1 March 2021 and 31 December 2022. Sequence metadata are collated by the Washington State Department of Health and include sample collection date, symptom onset date, de-identified patient ID, county of home location, postcode of home location, age group (0–9 years, 10–19 years, 20–29 years, 30–39 years, 40–49 years, 50–59 years, 60–69 years, 70–79 years and 80 years and above) and vaccination status upon positive test. For patients with multiple sequences in the database (2,309 out of 114,306 patients), we restricted our analysis to the earliest sequence collected. Among these 114,306 sequences, the age information is missing for 1 sequence, the county information for 659 sequences and the postcode information for 1,011 sequences.

Consensus sequences are extracted from the GISAID EpiCoV database^46,47 and curated using the Nextstrain nCoV ingest pipeline⁴⁸. We discard sequences with undefined Nexstrain clade assignments (8 sequences out of 114,306). This leaves us with 114,298 sequences, with 114,297 sequences gathered from patients with known age, 113,639 sequences gathered from patients with known county of home location and 113,287 sequences gathered from patients with known postcode of home location. In total, 96% of sequences have coverage in greater than 90% of the genome. We match postcodes to zip code tabulation areas (ZCTAs). For postcodes that do not have a ZCTA with the same name, we manually match them by looking at ZCTA boundaries. All analyses at the postcode level use ZCTA metadata information. We extracted the postcodes of WA prison facilities online⁴⁹.

Pairwise genetic distances

We compute pairwise genetic distances between Washington state sequences with the ape R package⁵⁰ using Hamming distances. To avoid unnecessary computational costs, we compare only sequences belonging to the same Nextstrain clade⁵¹ and generate one distance matrix per clade. We do not expect the clade definition to impact the pairs of identical sequences that we identified as pairs of identical sequences should always belong to the same clade.

Generating pairs of identical sequences from a sequence data file is the most computationally expensive step in this analysis. To provide context, generating a distance matrix from 1,000 sequences takes 33 s, while 10,000 sequences takes 1 h 37 min, on 1 core of an Apple M2 chip. Generating the full distance matrix for the analysis set of 113,287 sequences took around 96 hours of compute time readily parallelized across a compute cluster. More efficient software tools can significantly bring that compute time down (for example, 1.14 h with pairsnp⁵²).

Workflow data

We use data describing the daily number of commuters between each WA county from the American Community Survey (2016–2020)²². This dataset provides the number of directed commuting flows between residence and workplace counties. We use the number of commuting flows between counties to compute the RR of commute between two regions (see below).

Mobile phone location data

We obtain mobile device location data from SafeGraph (https://safegraph.com/), a data company that aggregates anonymized location data from 40 million devices, or approximately 10% of the US population, to measure foot traffic to over 6 million physical places (points of interest (POIs)) in the United States. Following a previous study⁵³, we estimate movement within and between counties in Washington from January 2020 to June 2022, using SafeGraph’s Weekly Patterns dataset, which provides weekly counts of the total number of unique devices visiting a point of interest (POI) from a particular home census block group (CBG). POIs are fixed locations, such as businesses or attractions. A visit indicates that a device entered the building or spatial perimeter designated as a POI. The home location of a device is defined as its common nighttime (18:00–07:00) CBG for the past 6 consecutive weeks. We restrict our dataset to POIs that have been tracked by SafeGraph since December 2019. To measure movement within and between counties, we extract the home CBG of devices visiting POIs in each week and limit the dataset to devices with home CBGs in the county of a given POI (within-county movement) or with home CBGs in counties outside of a given POI’s county (between-county movement). To adjust for variation in SafeGraph’s device panel size over time, we divide each county’s census population size by the number of devices in SafeGraph’s panel with home locations in that county each month and multiply the number of weekly visitors by that value. For each mobility indicator, we sum adjusted weekly visits across POIs from March 2021 to June 2022. We use the number of visits between counties to compute the RR of movement from mobile phone data between two regions (see below).

To explore potential geographical biases in the mobility data, we divided the weekly number of devices residing in each county by the weekly number of devices residing in WA state (observed proportions) and compared these values to expected proportions based on county and state census population sizes during 2020–2022. SafeGraph’s panel consistently captured 2–5% of each county’s population, with strong correlations between device counts and census population sizes (Spearman’s ρ = 0.99; Supplementary Fig. 20). We estimated county-level bias as the observed proportion of devices tracked by SafeGraph in individual counties relative to WA state minus the expected proportion based on census population sizes. Annual bias estimates for individual counties ranged from −2.2% to 1.7%, with no clear trend of over- or under-representation by population size or urban–rural classification (Supplementary Fig. 21). Although the most populous counties in WA state tend to have greater absolute bias, large counties are both under-represented and over-represented in the SafeGraph dataset (Supplementary Fig. 21). For example, the most populous county in WA, King County, was slightly under-represented each year (−2.2% to −1.6% bias; green negative outlier in Supplementary Fig. 21), while three of the other top five largest counties (Clark, Pierce and Spokane) were slightly over-represented (1.1% to 1.7% bias; pink, blue and goldenrod positive outliers in Supplementary Fig. 21). Our method for estimating geographical bias is based on SafeGraph’s Google Co-Lab Notebook on Quantifying Bias⁵⁴.

Social-contact data

We use synthetic social contact data for WA generated previously³⁴ based on reconstructing synthetic populations of interacting individuals using WA population demographics. They describe the per-capita probability for an individual of age i of interacting with someone of age j during a day.

Quantifying connectivity between groups

From genetic data

To quantify connectivity from genetic data, we compute the RR for sequences separated by a given genetic distance of being in given subgroups of the population. Let n denote the number of sequences included in the study and H_i,j the Hamming distance between sequences indexed by i and j. Let S_i denote the subgroup of sequence i. We introduce $n_A,B^d$ the number of Hamming distance matrix elements (excluding the diagonal) equal to d where sequence i belongs to group A and sequence j belongs to group B.

$$n_A,B^d=\mathop\sum \limits_i=1^n\mathop\sum \limits_j=1^n1_\i\ne j\\,1_\H_i,j=d\\,1_\S_i=A\\,1_\S_j=B\$$

where X → 1_X is the indicator function which is equal to 1 if X is true and 0 otherwise.

Let $n_A,\bullet ^d=\sum _B\,n_A,B^d$ and $n_\bullet ,\bullet ^d=\sum _A\,\sum _B\,n_A,B^d$.

We derive the RR $RR_A,B^d$ for sequences separated by a genetic distance d of being observed in subgroups A and B compared to what is expected from the sequencing effort in the different subgroups of the population as:

$$RR_A,B^d=\fracn_A,B^d/n_A,\bullet ^dn_B,\bullet ^d/n_\bullet ,\bullet ^d=\fracn_A,B^d\times n_\bullet ,\bullet ^dn_A,\bullet ^d\times n_B,\bullet ^d$$

(1)

The numerator $n_A,B^d/n_A,\bullet ^d$ corresponds to the proportion of the pairs where the subgroup of sequence i is A that are occurring with the B group.

The denominator $n_B,\bullet ^d/n_\bullet ,\bullet ^d$ is a normalization factor quantifying the contribution of group B to the total number of pairs separated by a Hamming distance of d. The ratio between these two quantities therefore quantifies the extent to which pairs of sequences observed in groups A and B are enriched compared with the number of sequences observed in these groups.

We used a subsampling strategy to compute CIs around these RRs. Bootstrapping (random sampling with replacement) would result in comparing sequences with themselves and therefore lead to biased upwards RRs of observing identical sequences in the same group. To avoid this, we used a subsampling strategy (random sampling without replacement) with a 80% subsampling rate (1,000 replicate subsamples).

We provide the tools to compute this RR metric from user-provided sequence and metadata files in the GitHub repository associated with this Article^55,56.

From mobility data

To quantify connectivity from mobility data, we compute the RR of movement between two geographical locations. Both the mobile phone and commuting mobility data provide directed flows between WA counties. Let w_A→B denote the number of commuters reported in the commuting data (respectively the number of visits for the mobile phone mobility data) whose home residence is in county A and who work in county B (respectively for which a visit in county B is reported). We compute the total movement flow between counties A and B as:

$$w_A,B=w_A\to B+w_B\to A$$

We then calculate the RR $RR_A,B^\rmm\rmo\rmb\rmi\rml\rmi\rmt\rmy$ of movement between counties A and B as:

$$RR_A,B^\rmm\rmo\rmb\rmi\rml\rmi\rmt\rmy=\fracw_A,B\times w_\bullet ,\bullet w_A,\bullet \times w_B,\bullet $$

where w_X,_• = ∑_Yw_X,Y and w_•,• = ∑_X,Yw_X,Y.

We compute a similar statistic by aggregating counties at the regional level (Supplementary Fig. 8).

From social contact data

To quantify connectivity from social contact data, we compute the RR of contact between two age groups. Mistry et al.³⁴ estimated the average daily number of contacts M_i,j that individuals of age i have with individuals of age j (considering one-year age bins). As we are interested in the age groups available in the sequence metadata, we reconstruct the average daily number of contacts c_A,B that individuals within age group A have with individuals in age group B as:

$$c_A,B=\frac\sum _i\in A\sum _j\in BM_i,j\times n_i\sum _i\in An_i$$

where n_i is the number of individuals of age i. We can then derive the total daily number of contacts between age groups A and B as Γ_A,B = c_A,B × N_A where N_A is the number of individuals in age group A. We then compute the RR $RR_A,B^\rmc\rmo\rmn\rmt\rma\rmc\rmt\rms$ for a contact of occurring between age groups A and B compared to what we expect if contacts were occurring at random in the population as:

$$RR_A,B^\rmc\rmo\rmn\rmt\rma\rmc\rmt\rms=\frac\Gamma _A,B\times \Gamma _\bullet ,\bullet \Gamma _A,\bullet \times \Gamma _B,\bullet $$

where Γ_A,• is the total daily number of contacts involving individuals within age group A and Γ_•,• is the total daily number of contacts in the population.

Studying directionality in transmission

We use the timing of sequence collection to understand directionality in transmission.

From sequence collection dates

We introduce t_x as the time at which the sequence x was collected. Let I_A,B denote the ensemble of pairs of identical sequences observed in groups A and B.

$$I_A,B=\ $$

therefore denotes the subset of these pairs with different sequence collection dates. We compute the proportion p_A→B of pairs consistent with the transmission direction A → B as:

$$p_A\to B=\frac\#(\ t_a < t_b\)\#(I_A,B)$$

where #(X) is the cardinal of X.

We also report 95% binomial CIs around these proportions.

From symptom onset dates

The delay between infection and sequence collection can be impacted by healthcare-seeking behaviours and access to testing, which might vary across age groups, geographical locations and time periods. If the distribution of the delay until testing differs between two subgroups A and B, the proportion of pairs of identical sequences p_A→B that are first collected in group A will both reflect the timing of infections and healthcare-seeking behaviours. When available, symptom onset dates should be less impacted by healthcare-seeking behaviours.

Among the 113,638 SARS-CoV-2 sequences with associated age group and county of home location information, symptom onset dates are available for 34,167 of them (30%). The availability of symptom onset information is susceptible to be impacted by individual demographic profiles (such as age), which could result in sequences with symptom onset information not being representative of all the sequences available. To avoid this, we impute missing symptom onset dates based on the empirical delay distribution between symptom onset and sequence collection (computed from individuals with known symptom onset dates) stratified by age group, time period and EWA/WWA region (Supplementary Fig. 22). Out of the sequences with known symptom onset dates, the absolute value of the delay between sequence collection and reported symptom onset is strictly greater than 30 days for 192 of them (<0.6%). We discarded these sequences in the computation of the symptom onset to sequence collection delay and considered that they were equivalent to sequences with missing symptom onset information (and therefore imputed their symptom onset dates). We generate 1,000 imputed datasets. For each of these imputed datasets, we compute the proportion $p_A\to B^\rmsympto$ of pairs with symptom onset dates occurring first in group A among pairs of identical sequences in groups A and B with distinct symptom onset dates. We then report the median across these 1,000 imputed datasets. We also generate a measure of uncertainty by computing on each of the imputed datasets 95% binomial CIs around the proportion $p_A\to B^\rmsympto$. We then report an uncertainty range around these proportion by using the minimum lower bound of the 95% CI and the maximum upper bound of the 95% CI across the imputed datasets.

Spatial analyses

Spatial extent of clusters

We reconstruct clusters of identical sequences from the pairwise genetic distance matrices¹³. Supplementary Fig. 23 depicts the typical size and duration of these clusters of identical sequences throughout the study period. To assess the spatial and temporal signal in clusters of identical sequences, we evaluate how the spatial extent of a cluster (summarized by its radius) evolves over time. For each cluster, we define primary sequences as the cluster’s earliest collected sequence. We then define the cluster’s primary ZCTAs as the ZCTAs of its primary sequences. We exclude clusters with ambiguous primary ZCTA (several primary ZCTAs) from this analysis. We define the radius of a cluster at a given time as the maximum distance between the primary ZCTA and the ZCTAs of the sequences collected by that time. We also compute the time required for sequences to be collected outside the primary ZCTA and primary county (using a similar definition as for the primary ZCTA). We report the mean cluster radius and the proportion of clusters remaining within the same geographical unit (ZCTA and county) as a function of the time since collection of the first sequence within a cluster. We generate 95% CI using a bootstrap approach with 1,000 replicates.

We compare the observed cluster radius and the observed proportion of clusters remaining within the same geographical unit to those expected from a null distribution assuming no spatial dependency between sequences within a cluster of identical sequences. We simulate a null distribution by randomly permuting the geographical locations of the WA sequences and recomputing our statistics of interest (cluster radius, proportion of clusters within the same county and proportion of clusters within the same ZCTA).

Impact of counties’ adjacency status

We compare the RRs of observing identical sequences between two counties depending on counties’ adjacency status (within the same county, between adjacent counties and between non-adjacent counties) using Wilcoxon signed-rank tests.

Impact of distance between counties

We examine how the RRs of observing identical sequences in two distinct counties compare with the geographical distance between counties’ centroids. We summarize this trend by reporting the LOESS curves with 95% CIs between the log RRs and the distance in kilometres.

Mapping using MDS

We evaluate the extent to which patterns of association obtained when looking at the location of pairs of identical sequences are consistent with global spatial structure. To do so, we performed non-metric MDS (NMDS) based on the matrix of RR of observing pairs of identical sequences between two counties. We restrict our analysis to the subset of counties for which there was always at least five pairs of identical sequences observed with the other counties in the subset. This is done to remove potential noise associated with low number of pairs observed. As the NMDS algorithm requires a measure of similarity between counties, we define the similarity s_A,B between counties A and B as:

$$s_A,B=\rme^-RR_A,B^$$

We perform two-dimensional NMDS using the vegan R package⁵⁷.

Transmission direction: EWA and WWA

We evaluate whether the timing of identical sequence collection is consistent with transmission rather occurring from WWA to EWA or EWA to WWA. We define four time periods corresponding the four epidemic waves experienced by WA during our study period (Supplementary Fig. 24): wave 4, March 2021–June 2021; wave 5, July 2021–November 2021; wave 6, December 2021–February 2022; and wave 7, March 2022–August 2022. For each of these time periods, we compute the proportion of pairs of identical sequences first collected in EWA among pairs of identical sequences observed in both EWA and WWA that were not collected on the same day. We report 95% binomial proportion CI around these proportions.

To examine whether our conclusions could be explained by differences in testing behaviours between EWA and WWA, we conduct a sensitivity analysis by imputing the date of symptom onset.

Mobility analyses

Relationship with mobility data

We compute the Spearman correlation coefficient between the RR of observing identical sequences between two counties and the RR of movement between two counties (both from mobile phone derived and commuting mobility data) as well as the geographical distance between counties’ centroids. We determine the percentage of variance in the genetic data explained by the mobility data by fitting GAMs predicting the RR of observing identical sequences based on the RR of movement between two counties, both on a logarithmic scale, using a thin-plate smoothing spline with 5 knots. For the GAM analyses, we remove pairs of counties for which the number of pairs of identical sequences or the total mobility flow is equal to 0, which ensures that both the RR of observing identical sequences and the RR of movement are strictly positive. We also fit a GAM between the RR of observing identical sequences between two counties (on a logarithmic scale) and the distance between counties centroids. We repeat these analyses at the regional level, instead of at the county level.

Outliers from that relationship

We define outliers in the relationship between genetic and mobility data as pairs of counties for which the absolute value of the scaled Pearson residuals of the GAM is greater than 3. As we expect RRs computed from a low number of pairs of identical sequences to be more noisy, we focus on pairs of counties for which at least 100 pairs of identical sequences are observed throughout the study period.

Spread between male prison postcodes

Centrality analysis

We characterize transmission between the ten postcodes with male state prisons by performing a network centrality analysis. We consider a network with ten nodes corresponding to these different postcodes. We define the weight of each edge as the RR of observing identical sequences between the two postcodes that define the nodes connected by this edge. This results in a fully connected network. For each node (postcode with a male state prison), we compute eigenvector centrality scores using the R igraph package. This centrality score measures a node’s influence in the network: nodes have higher scores when they are connected to other influential nodes.

Shared big clusters between postcodes

We define large clusters of identical sequences in the prison networks as clusters of identical SARS-CoV-2 sequences (1) that are observed in at least two postcodes with male prisons and (2) with at least 15 sequences in prison postcodes.

Age analyses

Relationship with social contact data

We quantify the association between the RR $RR_A,B^$ of observing identical sequences between two age groups A and B and the RR of contacts $RR_A,B^\rmc\rmo\rmn\rmt\rma\rmc\rmt\rms$ between these two groups using a GAM on a logarithmic scale. We report the percentage of variance in the RR of identical sequences explained by the RR of contact from the GAM. We also report the Spearman correlation coefficient between $RR_A,B^$ and $RR_A,B^\mathrmcontacts$.

Age-specific transmission across scales

To understand how age-specific transmission patterns vary across spatial scales, we compare the RR of observing identical sequences between age groups using all pairs of identical sequences, using only pairs of identical sequences in different postcodes and using only pairs of identical sequences in different counties.

Typical direction of spread between ages

We use sequence collection dates to explore transmission direction between age groups across four periods (March 2021–June 2021, July 2021–November 2021, December 2021–February 2022 and March 2022–August 2022). To facilitate the interpretation of these results, we introduce an earliness score that measures the contribution of a given age group to transmission to other age groups. For an age group A, this score is equal to the proportion of pairs of identical sequences first observed in age group A among all pairs of identical sequences observed in age groups A. We also report the 95% binomial CI around this score.

To explore whether our conclusions could be explained by differences in testing behaviours between age group, we conduct a sensitivity analysis by imputing the dates of symptom onset and using symptom onset dates instead of sequence collection ones (Supplementary Fig. 15). We also compute earliness scores on each of the 1,000 datasets with imputed symptom onset dates using the same definition as that based on sequence collection dates. We then report the median earliness score across all 1,000 datasets as well as an uncertainty range defined as the minimum lower bound and the maximum upper bound of the 95% binomial CI around this score for each of the imputed dataset.

RR between vaccination groups

Available matched patient information include details regarding the individuals’ vaccination statuses upon positive test: No valid vaccination record (denoted unvaccinated); completed primary series (denoted vaccinated); and completed primary series with an additional dose (denoted boosted).

Here, we use this information to quantify mixing between groups characterized by their vaccination status. We focus on the mixing between vaccination groups within age groups to avoid biases coming from age group and vaccination status being correlated. Among sequences collected within each period (4 waves) and age groups in decade, we compute the RR of observing identical sequences between vaccination groups. We included the boosted vaccination group for wave 6 (Omicron BA.1 wave) only for age groups older than 10 years, and included the boosted vaccination group for wave 7 for the 0–9 year age group. We included the 0–9 year age group in our analysis only from wave 6 (Omicron BA.1 wave) and the 10–19 year age group from wave 5 (Delta wave).

To quantify the tendency of individuals of transmitting to individuals with the same vaccination status, we compute for each vaccination groups (V₁, V₂) the ratio $RR_V_1,V_2/RR_V_1,V_1$. Values lower than 1 indicate that the enrichment of pairs of identical sequences is greater within the same vaccination group than between different vaccination groups. Such values suggest assortativity in mixing patterns between vaccination groups.

Sensitivity analysis: direction analysis

In the former paragraphs, we describe an approach based on the timing of pairs of identical sequences to better understand the typical transmission direction between groups. The interpretation of this pair-based analysis is complicated by several factors. First, clusters of identical sequences can span more than two groups. Second, even in instances where clusters only span two groups, counting pairs could improperly capture transmission direction, for example if local transmission of the cluster is occurring within the two groups. We implemented this pair-based approach as an intuitive exploration of whether the timing of sampling of identical sequences might provide some signal about transmission direction. This pair-based approach is crude but yet interesting as we do expect groups that tend to be the source to more often be observed first within clusters or pairs of identical sequences. As a sanity check, we perform a sensitivity analysis relying on clusters of identical sequences that are observed only within two groups. We define the source group as the group of the earliest collected sequence within the cluster of identical sequences. We remove ambiguous clusters, meaning clusters with two potential source groups, from the analysis. For clusters observed in groups A and B, we compute the proportion of clusters with source group A. We refer to this proportion as ‘proportion from clusters’ to distinguish it from the ‘proportion from pairs’ that we use in our main analysis. We compute the 95% CI around the proportion from clusters. This proportion from clusters should be more robust that the proportion from pairs but tends to be noisier as we are computing the proportion from less observations. We then compare the proportion obtained from pairs and from clusters. We compute the Spearman correlation coefficient between these two proportions using all pairs of groups or only pairs of groups for which the CIs do not contain 50% for the two proportions.

Link between mutations and generations

In this section, we derive the probability distribution of the number of mutations M_AB separating the consensus genomes of two infected individuals A and B conditional on the number of transmission generations G_AB separating them.

Generation time distribution

We assume that the generation time (that is, the average duration between the infection time of an infector and an infectee) follows a Gamma distribution of shape α and scale β. The time between g generations then follows a Gamma distribution of shape g × α and scale β assuming independence of successive transmission events. Let f_α,β(⋅) denote the probability density function of a Gamma distribution of shape α and scale β.

Mutations events

Let M_AB denote the number of mutations separating their infecting viruses. Let μ denote the mutation rate of the virus (in mutations per day). Let $T_AB^\rmevo$ denote the evolutionary time separating A and B (in days).

Assuming a Poisson process for the occurrence of mutations, we have:

$$M_AB \sim \mathcalP(\mu \times T_AB^\rmevo)$$

Probability distribution

Let G_AB denote the number of generations separating two infected individuals A and B belonging to the same transmission chain.

$$\beginarraylP[M_AB=m| G_AB=g]=\int _t_AB^\rmevo=0^+\infty P[M_AB=m| G_AB=g,T_AB^\rmevo=t_AB^\rmevo]\\ \,\,\,\,\times p(t_AB^\rmevo| G_AB=g)\,\rmdt_AB^\rmevo\\ \,\,\,=\int _t_AB^\rmevo=0^+\infty P[M_AB=m| T_AB^\rmevo=t_AB^\rmevo]\times f_\alpha g,\beta (t_AB^\rmevo)\,\rmdt_AB^\rmevo\\ \,\,\,=\int _t_AB^\rmevo=0^+\infty \frac(\mu t_AB^\rmevo)^m\times \rme^-\mu t_AB^\rmevom!\times \frac\beta ^\alpha g\times (t_AB^\rmevo)^\alpha g-1\times \rme^-\beta t_AB^\rmevo\Gamma (\alpha g)\,\rmdt_AB^\rmevo\\ \,\,\,=\,\frac\mu ^m\,\beta ^\alpha gm!\,\Gamma (\alpha g)\int _t_AB^\rmevo=0^+\infty (t_AB^\rmevo)^m+\alpha g-1\times \rme^-(\mu +\beta )\times t_AB^\rmevo\,\rmdt_AB^\rmevo\\ \,\,\,=\,\frac\mu ^m\,\beta ^\alpha g\,\Gamma (m+\alpha g)m!\,\Gamma (\alpha g)\,(\mu +\beta )^m+\alpha g\int _t_AB^\rmevo=0^+\infty f_m+\alpha g,\mu +\beta (t_AB^\rmevo)\,\rmdt_AB^\rmevo\\ \,\,\,=\,\frac\Gamma (m+\alpha g)m!\,\Gamma (\alpha g)\times \left(\frac\beta \beta +\mu \right)^\alpha g\times \left(\frac\mu \beta +\mu \right)^m\endarray$$

which is the probability mass function of a negative binomial distribution of parameters:

$$\beginarraylr=\alpha g\\ p=\frac\beta \beta +\mu \endarray$$

Application to SARS-CoV-2

We compute these probabilities for SARS-CoV-2 considering a mutation rate μ = 8.98 × 10⁻² substitutions per day (32.76 substitutions per year)⁵⁸. We assume that the generation time is Gamma distributed with a mean 5.9 days and s.d. of 4.8 days (ref. ⁵⁹).

Performance of the RR framework

We conduct a simulation study to evaluate how our RR framework performs under different sequencing scenarios. We also compare the results obtained from a phylogeographical analysis.

Simulating synthetic sequence data

We use ReMASTER⁶⁰ to simulate an SEIR epidemic in a structured population with 5 demes, each populated with 100,000 inhabitants. We simulate an epidemic characterized by a basic reproduction number R₀ of 2 with a daily time-step. We initiate the simulations by introducing a single infected individual (compartment I) in the population (group of index 0). We consider a pathogen with a 3,000 kb genome evolving following a Jukes–Cantor evolution model with a substitution rate of 3 × 10⁻⁵ substitutions per site per day. After infection, infected individuals enter an exposed (E) compartment during which they are not infectious yet and that they exit at a rate of 0.33 per day. They then enter an infectious (I) compartment where they are infectious that they exit at a rate of 0.33 per day. Sequencing occurs after exit of the I compartment. Given that our RR does not account for directionality in transmission, we considered a scenario with symmetric migration rates. We draw migration rates between demes from a log-uniform distribution of parameters (10⁻³, 10⁻¹).

We then explore two sequencing scenarios. In an unbiased scenario, we assume that each individual has the same probability of being sequenced in each deme. In a biased scenario, we assume that the sequencing probability varies between demes. We draw deme-specific relative sequencing probabilities from a log-uniform distribution of parameters (10⁻³, 10⁻¹). In the unbiased scenario, we fix sequencing probability to the mean of the sequencing probabilities across demes in the biased scenario. We explore different sequencing intensities by scaling these probabilities by different multiplicative factor (Supplementary Table 1): a scaling factor of 0.1 resulting in a mean sequencing probability of 0.43% and a dataset of around 1,700 sequences (used for the DTA analyses); a scaling factor of 0.5 resulting in a mean sequencing probability of 2.16% and a dataset of around 8,600 sequences (used for the RR and the DTA analyses); and a scaling factor of 2 resulting in a mean sequencing probability of 8.66% and a dataset of around 34,500 sequences (used for the RR analyses).

DTA

We conduct phylogeographical inference using symmetric DTA¹⁴ using the Bayesian stochastic search variable selection (BSSVS) model implemented in BEAST (v.1.10.4)⁶¹ applied to the synthetic data simulated in our two sequencing scenarios. To isolate the accuracy and precision of the phylogeographical reconstruction, we run our DTA using an empirical tree that is generated directly from ReMASTER simulations. Directly inputting such a tree is not possible in real-world scenarios in which the genealogical tree must be (noisily) estimated from empirical sequence data. In this case, it serves a demonstration of the power of DTA when provided perfect genealogical signal. The empirical tree approach also requires substantially less computation and therefore enabled us to analyse datasets of thousands of sequences using DTA in acceptable time.

Two independent Markov chain Monte Carlo (MCMC) procedures are run for 2.5 × 10⁸ iterations and sampled every 1,000 iterations. The resulting posterior distributions are combined after discarding initial 10% of sampled trees as burn-in from each of them. We used Tracer (v.1.7)⁶² to assess convergence and to estimate the effective sampling size (ESS), ensuring ESS values greater than 200 for each migration rate estimate. We adjust the estimated migration rates by the estimated rate scalar to calculate the per-day rates of transition between demes.

To evaluate how estimating the genealogical tree from empirical sequence data impacts both results and computing times, we perform an additional phylogeographical analysis based on simulated but this time jointly inferring the genealogical tree and migration rates. We run this model for 24 days (corresponding to 475,733,000 MCMC steps), until each migration rate parameter has an ESS greater than 200.

RR analysis

We compute the RR of observing identical sequences between two demes i and j and compare these RR to the daily probability p_i,j of migration between these two demes which is computed as:

$$\beginarraylp_i,j=1-\exp (-m_i,j)\quad j\ne i\\ p_i,i=1-\sum _k\ne ip_i,k\endarray$$

where m_i,j is the migration rate between demes i and j. We generate 95% subsampling CIs around the RRs using an 80% subsampling rate.

Expected relationship with RR of contact

We perform a simulation study to characterize the expected relationship between the RR of observing identical sequences between groups and the RR of contacts between these groups. We illustrate this by looking at transmission between age groups but we expect a similar relationship between the RR of observing identical sequences between regions and the RR of movement between these regions.

To do so, we generate clusters of identical sequences including the age group of the corresponding infected individuals assuming a probability that an infector and an infectee have the same consensus sequences p of 0.7 (close to the value we previously estimated for SARS-CoV-2¹³), a reproduction number R of 1.2 and a sequencing fraction p_seq of 0.1.

We use a contact matrix estimated previously³⁴ to characterize disease transmission between age groups. We assume that the probability p_A,B for a contact from an infected individual in age group A to occur with a susceptible individual in age group B is equal to:

$$p_A,B=\fracc_A,B\sum _B^\prime c_A,B^\prime $$

where c_A,B is the mean daily number of contacts that an individual in age group A has with individuals in age group B. We introduce an age-specific reproduction number (R_A) describing the average number of secondary cases infected by a single primary case within age group A. As different age groups have different average daily total number of contacts, the age-specific reproduction number varies between age groups. It can be derived as:

$$R_A=\frac\sum _B^\prime c_A,B^\prime \rho (C)\times R$$

where ρ(C) is the maximum eigenvalue of the matrix C = (c_A,B)⁶³. We then simulate individual clusters of identical sequences using the following steps. First, we initialize clusters by drawing the age of the primary case A_primary from a uniform distribution. Second, we simulate clusters as successive infections with identical genomes. At each generation, for each individual infected at the previous generation, let A denote the age of this infectious individual. We use the following procedure:

1.

Draw the number of infections with the same genotype: $I\sim \mathcalP((p\times R_A)^-1)$ (Poisson distribution).
2.

Draw the age of these individuals: $(A_1,…,A_I)\sim \mathcalM(I,n_\rmage,(p_A,1,…,$$p_A,n_\rmage))$ where $\mathcalM(n,k,(p_1,…,p_k))$ is a multinomial distribution with n trials, k possible events with probabilities (p₁, …, p_k).
3.

Draw the sequencing status of each new cluster member from a Bernoulli distribution of parameter p_seq.

We end simulations after ten generations to minimize computational costs.

Dataset size required to compute RRs

We implement a downsampling strategy to understand the amount of sequencing data required to compute RR estimates. We consider genome datasets of the following sizes: 1 × 10², 2 × 10², 3 × 10², 4 × 10², 5 × 10², 6 × 10², 7 × 10², 8 × 10², 9 × 10², 1 × 10³, 2 × 10³, 3 × 10³, 4 × 10³, 5 × 10³, 6 × 10³, 7 × 10³, 8 × 10³, 9 × 10³, 1 × 10⁴, 2 × 10⁴, 3 × 10⁴, 4 × 10⁴, 5 × 10⁴, 6 × 10⁴, 7 × 10⁴, 8 × 10⁴, 9 × 10⁴, 1 × 10⁵. For each of these dataset sizes, we generated 100 downsampled datasets from our WA sequencing data. For each of these downsampled datasets, we compute the RR of observing identical sequences between age groups (Supplementary Fig. 25). To understand how the number of groups studied impacts the amount of data required, we also compute RR of observing identical sequences between aggregated age groups:

0–39 years and over 40 years for 2 age groups;
0–29 years, 30–59 years and over 60 years for 3 age groups;
0–19 years, 20–39 years, 40–59 years and over 60 years for 4 age groups;
0–9 years, 10–19 years, 20–29 years, 30–39 years, 40–49 years, 50–59 years, 60–69 years, 70–79 years and over 80 (standard definition used throughout the paper) for 9 age groups.

We compute the error between the RR obtained from a subsampled dataset RR^d and the RR from the full dataset RR^f as:

$$\epsilon =\fracRR^d-RR^fRR^f$$

For each pair of age groups, we then compute the number of pairs of identical sequences required for the error to be below 10%.

Extending the Hamming distance threshold

In this work, we assess how pairs of infected individuals whose infected genome is separated by 0 mutations can help to characterize population-level transmission patterns. We apply this method to SARS-CoV-2 sequences from WA but our approach should be broadly applicable for epidemics caused by pathogens where the timescale of mutation events is similar to that of transmission events. In this section, we describe a simulation approach to understand how a pathogen’s mutation rate could impact the optimal Hamming distance threshold to apply our RR framework.

We implement the same simulation framework as the one described in the section ‘Performance of the RR framework’. We consider that each infection has the same probability of being sequenced (equal to 4.33%, corresponding to a sequencing probability scaling factor of 1). We explore a range of scenarios for the pathogen’s mutation rate. To do so, we introduce a multiplicative scaling factor for the baseline pathogen’s mutation rate (3 × 10⁻⁵ substitutions per site per day) with values ranging between 0.1 and 10. For each multiplicative scaling factor, we perform 100 replicate simulations.

For each simulated epidemic, we count the number of pairs separated by less than d mutations between two regions (for d varying between 0 and 10). In certain scenarios (for example, those characterized by a high mutation rate and a low Hamming distance threshold d), we sometimes do not observe any pairs less than d mutations away in a specific group. To be able to compute the RR even in those scenarios, we report a modified version of the RR of observing sequences separated by less than d mutations between two regions:

$$\mathopRR\limits^\sim _A,B^d=\frac(n_A,B^d+1)\times (n_\bullet ,\bullet ^d+1)(n_A,\bullet ^d+1)\times (n_B,\bullet ^d+1)$$

with the same notations as the ones used in the definition of the RR of observing sequences less than d mutations away in groups A and B.

We then compute the Spearman correlation coefficient between the RR for pairs of sequences less than d mutations apart of being in two regions and the daily migration probability between these regions. In simulations in which the s.d. of the RR is equal to 0 (all modified RRs have the same value), we assume that the Spearman correlation coefficient is equal to 0 (RRs are not informative about migration rates).

Ethics approval

The Washington State and University of Washington Institutional Review Boards determined this project to be surveillance activity and exempt from review; the need for informed consent was waived through this determination. Under Washington State IRB Exempt Determination 2020-102, symptom onset date, age group, residence county, residence postcode and vaccination history was provided by the Washington Department of Health from the Washington Disease Reporting System for individuals with linked sequenced SARS-CoV-2 samples from 1 March 2021 to 31 December 2022. Sequencing and analysis of samples from the Seattle Flu Study was approved by the Institutional Review Board (IRB) at the University of Washington (protocol STUDY00006181). Sequencing of remnant clinical specimens at UW Virology Lab was approved by the University of Washington Institutional Review Board (protocol STUDY00000408).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.