SPRTA and aBayes
Given an estimated rooted phylogenetic tree T and data D in the form of a multiple sequence alignment, our aim is to assign confidence scores to branches b of T.
We take inspiration from the approximate Bayes (aBayes) approach17. aBayes assigns to b a probability score Pr(b∣D, T\b) based on the ratio of the likelihood Pr(D∣T) of the estimated binary tree T versus the likelihoods of the trees \(T_i^b\) obtained using nearest neighbour interchange36 (NNI) moves centred around b (which comprise \(T=T_1^b\) itself in addition to two tree topologies not containing the clade in T descending from b):
$$\rmaBayes(b)=\Pr (b| D,T\backslash b)=\frac T) T_i^b).$$
(2)
These NNI moves perform small changes to T adjacent to branch b. One of the appeals of aBayes is that it can score not only T, but also the two alternative topologies considered, \(T_2^b\) and \(T_3^b\), not containing the clade defined by b:
$$\Pr (T_j^b| D,T\backslash b)=\frac\Pr (D T_i^b).$$
(3)
aBayes has a topological focus, with score aBayes(b) for branch b interpreted as the support for the existence of the clade of T containing all descendants of b. aBayes(b) is in effect an approximate Bayesian posterior score for this clade, where a flat tree prior is assumed. In contrast to typical Bayesian phylogenetics, however, instead of integrating over branch lengths we define \(\Pr (D| T_i^b)\) as the maximum-likelihood score of topology \(T_i^b\) over all possible branch lengths, and we only consider the alternative topologies obtainable through a single NNI move17 starting from T. Although computationally much faster than Felsenstein’s bootstrap, aBayes can still be too demanding for large genomic epidemiological datasets if implemented within classical maximum-likelihood phylogenetic methods (for example, see Fig. 2). Also, aBayes is defined based only on NNI moves, an approach insufficiently comprehensive for pandemic-scale data9: owing to the existence of many phylogenetic topologies with similar likelihood29, the two alternative topologies obtainable through NNI moves, \(T_2^b\) and \(T_3^b\), might represent only a very small subset of plausible alternative topologies not containing the clade defined by b.
Here we address these limitations and define a new measure of branch support, SPRTA, that is particularly useful in the context of large-scale genomic epidemiology, but is also applicable more generally in phylogenetics. First, to address the problem of computational demand, we consider trees estimated using methods suitable for pandemic-scale datasets, such as MAPLE9. In the following, we assume that the likelihood of alternative tree topologies is also calculated with MAPLE or a similar method.
We define the SPRTA support of branch b by considering a certain number Ib of possible alternative topologies \(T_i^b\) (1 ⩽ i ⩽ Ib), obtained by performing single subtree prune and regraft36 (SPR) moves that relocate Sb, the subtree of T containing all descendants of b, as a descendant of other parts of T (Fig. 1). Again, we assume for simplicity that \(T_1^b=T\) corresponds to the null SPR move. Compared to NNI moves, SPR moves are much more numerous, and can cause long-range changes to the tree; as such, SPR moves create a much more comprehensive set of alternative evolutionary histories than NNI moves. For each branch b, the number of possible SPR moves is linear in the number of sequences in the considered dataset, which means if we do not select which of these SPR moves we focus on, Ib can become too large, and the calculation of SPRTA scores too computationally demanding. However, to obtain an accurate evaluation we only need to consider topologies that have non-negligible likelihood scores compared to T.
For this reason, we first perform an initial, approximate evaluation of alternative tree topologies obtained through SPR relocations of Sb using fixed branch lengths (we leave the length of the placement branch equal to the length of b, and we assess placements only halfway along branches). From this, we retain only topologies with an initial log-likelihood score difference from T within a threshold corresponding approximately to one extra mutation event (the natural logarithm of the genome length, which for SARS-CoV-2 is about 10.3). The likelihoods of all Ib topologies passing this threshold are then more deeply evaluated by optimizing branch lengths. This two-step approach is similar to the ‘baseball’ heuristic of the metagenomic query mapper pplacer19. More precisely, when we consider an alternative placement of Sb on branch \(b^^\prime \), let us use \(n^^\prime \) to denote the new node at the conjunction of \(b^^\prime \) with Sb. For this placement, the three branches whose length we optimize are the one connecting \(n^^\prime \) with Sb (length l1), the one within \(b^^\prime \) below \(n^^\prime \) (length l2), and the one within \(b^^\prime \) above \(n^^\prime \) (length l3), similarly to the ‘lazy subtree rearrangement’ approach of RAxML37 (see Extended Data Fig. 10a). \(\Pr (D| T_i^b)\) is defined as the maximum likelihood obtained by optimizing these three branches, while leaving all other model parameters and branch lengths unaltered. This approximation of \(\Pr (D| T_i^b)\) is the same one made by MAPLE, and is justified by the fact that, at the low levels of divergence typically considered in genomic epidemiology, changes in topology and branch lengths usually only affect nodes near those directly affected by the changes9.
The likelihood score \(\Pr (D| T_i^b)\) of an SPR move is calculated by MAPLE (up to a normalizing factor) by comparing the partial likelihoods at node B (the root of Sb), informed by sequence data within Sb, against the partial likelihoods at the new placement nodes Ai, informed by the sequence data within T\Sb (ref. 9). The partial likelihoods at node B represent genetic sequence uncertainty for the ancestor represented by this node. In genomic epidemiology, due to dense sampling and short divergence, this uncertainty is often negligible, and these partial likelihoods will often support only one genome. Likewise, partial likelihoods at Ai will often also only support a single genome. In these cases, the score \(\Pr (D| T_i^b)\) (up to a normalizing factor) is used as an approximation of the probability that the genome in B evolved from the genome in Ai. In case uncertainty of the ancestral genetic sequence at these nodes is not negligible, we consider all possible ancestral genomes, weighted according to their likelihood, and \(\Pr (D| T_i^b)\) will approximate the probability that any of the possible genomes in B evolved from any of the possible genomes in Ai.
Due to the initial filtering of plausible alternative topologies, Ib depends on branch b. These topologies are the same ones typically considered during the final stage of standard tree search in MAPLE.
SPRTA defines the support probability of b as
$$\rmSPRTA(b)=\Pr (b| D,T\backslash b)=\frac T) T_i^b)$$
(4)
where T \ b represents the two subtrees obtained by removing b from T. We can similarly define \(\Pr (T_i^b| D,T\backslash b)\) for the alternative placements \(T_i^b\ne T\) (2 ⩽ i ⩽ Ib) for any subtree Sb considered. Note that SPRTA typically evaluates a much larger number of alternative topologies than aBayes (equation (2)): up to quadratic in the number of genomes analysed for SPRTA, but only linear for other local support measures such as aBayes or aLRT.
When evaluating possible SPR placements, placements resulting in the same topology are equivalent, and so we only consider them once. In particular, we take into account multifurcations caused by branches of length 0, and consider placements into points of the tree with a null distance between them (based on considering optimized placement-specific branch lengths as just described) as equivalent.
Because relative likelihood calculations of alternative topologies are performed as a standard component of tree inference in MAPLE, assessing SPRTA support probabilities adds negligible computational overhead.
MAPLE infers rooted phylogenetic trees, and as such we have defined and implemented SPRTA to assess rooted tree inference. The same principles could however also be applied to the evaluation of unrooted trees. In this case, to calculate the score of a branch b, we would not only consider SPR moves representing alternative placements of subtree Sb within T \ Sb, but also SPR moves representing alternative placements of T \ Sb within Sb, so increasing the number of SPR moves to be considered for each branch b, but leaving the rest of the approach unaltered.
Benchmarking of SPRTA support
Branch support measures are often used to assess the expected accuracy of phylogenetic inference. However, how do we assess the accuracy of these measures of accuracy? We approach this problem using simulations, for which we have a ground truth against which to compare estimates. First, we simulate a tree and a set of genomes (Extended Data Fig. 10b); second, we estimate a tree from these simulated genomes (Extended Data Fig. 10c). Some branches of the inferred tree will be correct and some will be wrong: therefore, third, we assess branch support measures according to their ability to give higher support scores to correctly estimated branches, and lower support scores to wrongly estimated branches (Extended Data Fig. 10d).
We benchmark branch support methods on simulated SARS-CoV-2 genome data (Methods, ‘Simulated genomes’) using our ‘mutational’ focus rather than a traditional ‘topological’ focus. We define phylogenetic correctness in terms of the genome evolutionary history implied by a phylogenetic tree. From each individual simulated dataset (see graphical example in Extended Data Fig. 10b), we first infer in MAPLE a maximum-likelihood phylogenetic tree T and mutation events by marginal posterior mutation mapping38 conditional on T (Extended Data Fig. 10c). Only mutation events inferred with >0.5 probability by MAPLE are considered here. We define estimated mutation events as pairs (G, m) where G is a whole genome sequence and m = (n1, p, n2) is a single-nucleotide substitution at position p of G from nucleotide n1 (contained in G at position p) to nucleotide n2 (Extended Data Fig. 10d). An inferred mutation event (G, m) is considered correct if it is also present in the simulated tree, otherwise it is considered as an estimation error (Extended Data Fig. 10d). Finally, we assign the support score of branch b (estimated by SPRTA or any other method) to all the mutations inferred to have occurred on b. We consider a branch support measure as more accurate if in our simulations it assigns higher support to correctly estimated mutation events, and lower support to erroneously inferred ones.
Evaluating the accuracy of SPRTA using inferred mutations might seem counter-intuitive, since our definition of SPRTA scores does not consider explicit mutational histories. However, by analysing the levels of support for different placements of subtree Sb associated with branch b, different possible mutation histories immediately ancestral to Sb are implicitly considered and evaluated via the likelihood of alternative subtree placements. We thus interpret SPRTA scores as the support for the hypothesis that the profile B at the lower end of b evolved from profile A at the upper end of b through mutations on branch b (see ‘Accuracy’).
In the scenario of short branches considered here, there is extremely low uncertainty in most mutation events and ancestral genomes implied by a given topology, and so we can often interpret the correctness of the mutational history inferred on b as the correctness of b itself. This does not mean that in this case b will be topologically correct—for example, the placement of rogue taxa within or outside the subtree Sb defined by b can make b topologically uncertain without causing uncertainty in the inferred mutational history or the placement of Sb.
Implementation and usage of SPRTA
We ran SPRTA as implemented within MAPLE v.0.6.8 (https://github.com/NicolaDM/MAPLE). While SPRTA can be run in MAPLE at the same time as tree inference with minimal additional computational cost (Methods ‘SPRTA and aBayes’), to aid comparability of computational performance with other approaches here we have considered its use to assess a pre-estimated input phylogenetic tree provided with the option –inputTree. We used options –numTopologyImprovements 0 –doNotImproveTopology to perform a shallow SPR search in MAPLE. We also used options –model UNREST –rateVariation to use an UNREST model39 with rate variation21, and option –estimateMAT to infer mutation events.
Other branch support methods
All other branch support measures considered here were calculated using IQ-TREE v.2.1.340 with options –seqtype DNA –seed 1 -m GTR+F+G4 –quiet -nt 1. As with SPRTA, we always use the tree estimated by MAPLE as a starting tree (supplied via the option -t) since on these datasets IQ-TREE will typically not converge to a tree with likelihood as high as MAPLE9. We used the following additional IQ-TREE options:
-
-B 1000 (1,000 bootstrap replicates) for UFBoot27
-
–fast -b 100 (100 bootstrap replicates and fast tree search) for Felsenstein’s bootstrap2
-
–fast -b 100 –tbe (100 bootstrap replicates and fast tree search) for TBE10
-
–fast –alrt 1000 (1,000 bootstrap replicates and fast tree search) for aLRT-SH16
-
–fast –alrt 0 (fast tree search) for aLRT15
-
–fast –abayes (fast tree search) for aBayes17
-
–fast –lbp 1000 (1,000 bootstrap replicates and fast tree search) for LBP14.
The number of bootstrap replicates has very limited impact on the computational demand of UFBoot2 and LBP, hence these were set to 1,000 for reduced stochasticity with minimal computational cost.
SARS-CoV-2 genome datasets
Viridian genome dataset
We applied SPRTA to a SARS-CoV-2 dataset containing 2,072,111 genomes collected up to February 2023. The consensus sequences of these genomes were consistently called with Viridian, a tool that prevents common reference biases in genomic regions of low sequencing coverage23. Furthermore, we filtered out potentially contaminated samples, and masked alignment columns affected by recurrent sequence errors21. We estimated a phylogenetic tree using MAPLE v.0.6.8 with an UNREST substitution model, rate variation, and deep SPR phylogenetic search. For a full description of data preparation and phylogenetic inference, see ref. 21. We ran SPRTA on this alignment and tree using MAPLE v.0.6.9, with the options described in ‘Implementation and usage of SPRTA’, and additionally with option –supportFor0Branches to evaluate the support of all genome placements, even those not involving mutations (see ‘Uncertainty of SARS-CoV-2 evolution’).
Simulated genomes
For benchmarking, we simulated SARS-CoV-2 genomes evolving along a known (‘true’) background phylogeny. The background tree we used was the publicly available 26 October 2021 global SARS-CoV-2 phylogenetic tree from http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/41 representing the evolutionary relationship of 2,250,054 SARS-CoV-2 genomes, as inferred using UShER8.
We used phastSim v.0.0.342 with options –treeFile public-latest.all.nwk –scale 0.00003344 –reference MN908947.3.fasta –alpha 0.2 –createNewick to simulate sequence evolution along this tree according to SARS-CoV-2 non-stationary neutral mutation rates22, using the SARS-CoV-2 Wuhan-Hu-1 genome43 as root sequence, and with gamma-distributed (α = 0.2) rate variation44 (similar to that estimated from real data21). These simulations of complete genomes were used for Figs. 2 and 3, and Extended Data Figs. 1e, 2 and 3.
We also created a second set of simulations mimicking the distribution of genome incompleteness from real data as in ref. 9. In each simulated sequence we included N and gap (–) characters copied in number and location from a randomly selected paired sequence from the real SARS-CoV-2 genome dataset considered in ref. 9. This step simulates the distribution of missing sequence data due to low sequencing depth at certain specific genome regions. Additionally, in each simulated sequence we masked a number of randomly selected SNPs (differences with respect to the reference genome) equal in number to the isolated ambiguous characters in the paired randomly sampled real sequence. This additional step mimics the pattern caused by mixed infections and contamination, in which phylogenetically informative positions are selectively masked in consensus genomes due to within-sample heterozygosity (see ref. 9 for more detail). This second set of simulations was used to create Extended Data Figs. 1a–d,f, 4 and 5.
Assessing the impact on mutation rates
To assess the impact of phylogenetic uncertainty on estimates of mutation patterns, we mimicked typical studies of mutation rate inference in SARS-CoV-222,45,46. We consider real SARS-CoV-2 data and the corresponding inferred tree and mutation events as described in ‘Viridian genome dataset’. We then created three datasets: one containing all inferred mutations, one containing only mutations on branches with at least 50% SPRTA support (representing a mildly conservative approach, discarding highly uncertain mutations), and one with only mutations on branches with at least 90% SPRTA support (representing a highly conservative approach, removing any moderately uncertain mutation).
Equilibrium frequencies (Extended Data Fig. 6a) represent the equilibrium nucleotide distribution of the Markov chain defined by the genome-wide mutation rate matrix47 inferred from the mutation counts as
$$q_ij\propto \fracn_ij\rm\pi _i,$$
(5)
where nij is the count of mutations from nucleotide i to nucleotide j and πi is the frequency of nucleotide i in the reference genome.
We assess the impact of phylogenetic uncertainty on site-specific mutation counts by calculating, for every genome position, the ratio of the most conservative mutation counts (those on branches with at least 90% support) to the least conservative mutation counts (those on all branches) (Extended Data Fig. 6b). To avoid high variance in values of this ratio at sites with low numbers of substitutions, only sites with at least 50 mutations of the given type over all branches were included.
Assessing the impact on Pango lineages
To assess the impact of phylogenetic uncertainty on the definition and the inference of the origin of Pango lineages, we mapped 1,542 Pango lineage consensus genomes (as of 28 February 2023; https://github.com/corneliusroemer/pango-sequences) onto our SARS-CoV-2 phylogenetic tree (‘Viridian genome dataset’) using MAPLE v.0.7.3. This mapping associates Pango lineages with nodes in our tree. Similarly to our SARS-CoV-2 alignment, we masked regions of recurrent sequence errors from these consensus genomes (see ref. 21). Of these genomes, 1,127 mapped onto our tree with ≤1 mutation separating them from the tree, and with ≥95% SPRTA placement support score. We discarded consensus genomes that were further removed (>1 mutation separating them from the tree), as they typically represent more-recent lineages that did not exist at the time our dataset was collected. Consensus genomes with low SPRTA placement support were discarded since they could not be uniquely associated with a single node in our tree (this uncertainty can be caused, for example, by incomplete genome sequences in our dataset). In total, these 1,127 genomes mapped onto 1,117 distinct tree branches, and represent the Pango lineages that we can place on the tree with high confidence.
We used these 1,127 consensus genome placements to assign a Pango lineage to each branch and sample in our tree: each was assigned to the lineage represented by the closest ancestral consensus genome placement (the first one met moving from the considered node towards the tree root). We used this assignment of Pango lineages to assess uncertainty in the lineage assignment of the genomes in our dataset. For each of the 2,072,111 genomes in our alignment, we considered the SPRTA scores of their current and alternative placements. To each considered placement, we assigned the Pango lineage of the placement branch (alternative placements on the same branch as the placement of a Pango consensus genome were ignored).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

