Assessing phylogenetic confidence at pandemic scales

November 6, 2025

30

SPRTA and aBayes

Given an estimated rooted phylogenetic tree T and data D in the form of a multiple sequence alignment, our aim is to assign confidence scores to branches b of T.

We take inspiration from the approximate Bayes (aBayes) approach¹⁷. aBayes assigns to b a probability score Pr(b∣D, T\b) based on the ratio of the likelihood Pr(D∣T) of the estimated binary tree T versus the likelihoods of the trees $T_i^b$ obtained using nearest neighbour interchange³⁶ (NNI) moves centred around b (which comprise $T=T_1^b$ itself in addition to two tree topologies not containing the clade in T descending from b):

$$\rmaBayes(b)=\Pr (b| D,T\backslash b)=\frac T) T_i^b).$$

(2)

These NNI moves perform small changes to T adjacent to branch b. One of the appeals of aBayes is that it can score not only T, but also the two alternative topologies considered, $T_2^b$ and $T_3^b$, not containing the clade defined by b:

$$\Pr (T_j^b| D,T\backslash b)=\frac\Pr (D T_i^b).$$

(3)

aBayes has a topological focus, with score aBayes(b) for branch b interpreted as the support for the existence of the clade of T containing all descendants of b. aBayes(b) is in effect an approximate Bayesian posterior score for this clade, where a flat tree prior is assumed. In contrast to typical Bayesian phylogenetics, however, instead of integrating over branch lengths we define $\Pr (D| T_i^b)$ as the maximum-likelihood score of topology $T_i^b$ over all possible branch lengths, and we only consider the alternative topologies obtainable through a single NNI move¹⁷ starting from T. Although computationally much faster than Felsenstein’s bootstrap, aBayes can still be too demanding for large genomic epidemiological datasets if implemented within classical maximum-likelihood phylogenetic methods (for example, see Fig. 2). Also, aBayes is defined based only on NNI moves, an approach insufficiently comprehensive for pandemic-scale data⁹: owing to the existence of many phylogenetic topologies with similar likelihood²⁹, the two alternative topologies obtainable through NNI moves, $T_2^b$ and $T_3^b$, might represent only a very small subset of plausible alternative topologies not containing the clade defined by b.

Here we address these limitations and define a new measure of branch support, SPRTA, that is particularly useful in the context of large-scale genomic epidemiology, but is also applicable more generally in phylogenetics. First, to address the problem of computational demand, we consider trees estimated using methods suitable for pandemic-scale datasets, such as MAPLE⁹. In the following, we assume that the likelihood of alternative tree topologies is also calculated with MAPLE or a similar method.

We define the SPRTA support of branch b by considering a certain number I_b of possible alternative topologies $T_i^b$ (1 ⩽ i ⩽ I_b), obtained by performing single subtree prune and regraft³⁶ (SPR) moves that relocate S_b, the subtree of T containing all descendants of b, as a descendant of other parts of T (Fig. 1). Again, we assume for simplicity that $T_1^b=T$ corresponds to the null SPR move. Compared to NNI moves, SPR moves are much more numerous, and can cause long-range changes to the tree; as such, SPR moves create a much more comprehensive set of alternative evolutionary histories than NNI moves. For each branch b, the number of possible SPR moves is linear in the number of sequences in the considered dataset, which means if we do not select which of these SPR moves we focus on, I_b can become too large, and the calculation of SPRTA scores too computationally demanding. However, to obtain an accurate evaluation we only need to consider topologies that have non-negligible likelihood scores compared to T.

For this reason, we first perform an initial, approximate evaluation of alternative tree topologies obtained through SPR relocations of S_b using fixed branch lengths (we leave the length of the placement branch equal to the length of b, and we assess placements only halfway along branches). From this, we retain only topologies with an initial log-likelihood score difference from T within a threshold corresponding approximately to one extra mutation event (the natural logarithm of the genome length, which for SARS-CoV-2 is about 10.3). The likelihoods of all I_b topologies passing this threshold are then more deeply evaluated by optimizing branch lengths. This two-step approach is similar to the ‘baseball’ heuristic of the metagenomic query mapper pplacer¹⁹. More precisely, when we consider an alternative placement of S_b on branch $b^^\prime $, let us use $n^^\prime $ to denote the new node at the conjunction of $b^^\prime $ with S_b. For this placement, the three branches whose length we optimize are the one connecting $n^^\prime $ with S_b (length l₁), the one within $b^^\prime $ below $n^^\prime $ (length l₂), and the one within $b^^\prime $ above $n^^\prime $ (length l₃), similarly to the ‘lazy subtree rearrangement’ approach of RAxML³⁷ (see Extended Data Fig. 10a). $\Pr (D| T_i^b)$ is defined as the maximum likelihood obtained by optimizing these three branches, while leaving all other model parameters and branch lengths unaltered. This approximation of $\Pr (D| T_i^b)$ is the same one made by MAPLE, and is justified by the fact that, at the low levels of divergence typically considered in genomic epidemiology, changes in topology and branch lengths usually only affect nodes near those directly affected by the changes⁹.

The likelihood score $\Pr (D| T_i^b)$ of an SPR move is calculated by MAPLE (up to a normalizing factor) by comparing the partial likelihoods at node B (the root of S_b), informed by sequence data within S_b, against the partial likelihoods at the new placement nodes A_i, informed by the sequence data within T\S_b (ref. ⁹). The partial likelihoods at node B represent genetic sequence uncertainty for the ancestor represented by this node. In genomic epidemiology, due to dense sampling and short divergence, this uncertainty is often negligible, and these partial likelihoods will often support only one genome. Likewise, partial likelihoods at A_i will often also only support a single genome. In these cases, the score $\Pr (D| T_i^b)$ (up to a normalizing factor) is used as an approximation of the probability that the genome in B evolved from the genome in A_i. In case uncertainty of the ancestral genetic sequence at these nodes is not negligible, we consider all possible ancestral genomes, weighted according to their likelihood, and $\Pr (D| T_i^b)$ will approximate the probability that any of the possible genomes in B evolved from any of the possible genomes in A_i.

Due to the initial filtering of plausible alternative topologies, I_b depends on branch b. These topologies are the same ones typically considered during the final stage of standard tree search in MAPLE.

SPRTA defines the support probability of b as

$$\rmSPRTA(b)=\Pr (b| D,T\backslash b)=\frac T) T_i^b)$$

(4)

where T \ b represents the two subtrees obtained by removing b from T. We can similarly define $\Pr (T_i^b| D,T\backslash b)$ for the alternative placements $T_i^b\ne T$ (2 ⩽ i ⩽ I_b) for any subtree S_b considered. Note that SPRTA typically evaluates a much larger number of alternative topologies than aBayes (equation (2)): up to quadratic in the number of genomes analysed for SPRTA, but only linear for other local support measures such as aBayes or aLRT.

When evaluating possible SPR placements, placements resulting in the same topology are equivalent, and so we only consider them once. In particular, we take into account multifurcations caused by branches of length 0, and consider placements into points of the tree with a null distance between them (based on considering optimized placement-specific branch lengths as just described) as equivalent.

Because relative likelihood calculations of alternative topologies are performed as a standard component of tree inference in MAPLE, assessing SPRTA support probabilities adds negligible computational overhead.

MAPLE infers rooted phylogenetic trees, and as such we have defined and implemented SPRTA to assess rooted tree inference. The same principles could however also be applied to the evaluation of unrooted trees. In this case, to calculate the score of a branch b, we would not only consider SPR moves representing alternative placements of subtree S_b within T \ S_b, but also SPR moves representing alternative placements of T \ S_b within S_b, so increasing the number of SPR moves to be considered for each branch b, but leaving the rest of the approach unaltered.

Benchmarking of SPRTA support

Branch support measures are often used to assess the expected accuracy of phylogenetic inference. However, how do we assess the accuracy of these measures of accuracy? We approach this problem using simulations, for which we have a ground truth against which to compare estimates. First, we simulate a tree and a set of genomes (Extended Data Fig. 10b); second, we estimate a tree from these simulated genomes (Extended Data Fig. 10c). Some branches of the inferred tree will be correct and some will be wrong: therefore, third, we assess branch support measures according to their ability to give higher support scores to correctly estimated branches, and lower support scores to wrongly estimated branches (Extended Data Fig. 10d).

We benchmark branch support methods on simulated SARS-CoV-2 genome data (Methods, ‘Simulated genomes’) using our ‘mutational’ focus rather than a traditional ‘topological’ focus. We define phylogenetic correctness in terms of the genome evolutionary history implied by a phylogenetic tree. From each individual simulated dataset (see graphical example in Extended Data Fig. 10b), we first infer in MAPLE a maximum-likelihood phylogenetic tree T and mutation events by marginal posterior mutation mapping³⁸ conditional on T (Extended Data Fig. 10c). Only mutation events inferred with >0.5 probability by MAPLE are considered here. We define estimated mutation events as pairs (G, m) where G is a whole genome sequence and m = (n₁, p, n₂) is a single-nucleotide substitution at position p of G from nucleotide n₁ (contained in G at position p) to nucleotide n₂ (Extended Data Fig. 10d). An inferred mutation event (G, m) is considered correct if it is also present in the simulated tree, otherwise it is considered as an estimation error (Extended Data Fig. 10d). Finally, we assign the support score of branch b (estimated by SPRTA or any other method) to all the mutations inferred to have occurred on b. We consider a branch support measure as more accurate if in our simulations it assigns higher support to correctly estimated mutation events, and lower support to erroneously inferred ones.

Evaluating the accuracy of SPRTA using inferred mutations might seem counter-intuitive, since our definition of SPRTA scores does not consider explicit mutational histories. However, by analysing the levels of support for different placements of subtree S_b associated with branch b, different possible mutation histories immediately ancestral to S_b are implicitly considered and evaluated via the likelihood of alternative subtree placements. We thus interpret SPRTA scores as the support for the hypothesis that the profile B at the lower end of b evolved from profile A at the upper end of b through mutations on branch b (see ‘Accuracy’).

In the scenario of short branches considered here, there is extremely low uncertainty in most mutation events and ancestral genomes implied by a given topology, and so we can often interpret the correctness of the mutational history inferred on b as the correctness of b itself. This does not mean that in this case b will be topologically correct—for example, the placement of rogue taxa within or outside the subtree S_b defined by b can make b topologically uncertain without causing uncertainty in the inferred mutational history or the placement of S_b.

Implementation and usage of SPRTA

We ran SPRTA as implemented within MAPLE v.0.6.8 (https://github.com/NicolaDM/MAPLE). While SPRTA can be run in MAPLE at the same time as tree inference with minimal additional computational cost (Methods ‘SPRTA and aBayes’), to aid comparability of computational performance with other approaches here we have considered its use to assess a pre-estimated input phylogenetic tree provided with the option –inputTree. We used options –numTopologyImprovements 0 –doNotImproveTopology to perform a shallow SPR search in MAPLE. We also used options –model UNREST –rateVariation to use an UNREST model³⁹ with rate variation²¹, and option –estimateMAT to infer mutation events.

Other branch support methods

All other branch support measures considered here were calculated using IQ-TREE v.2.1.3⁴⁰ with options –seqtype DNA –seed 1 -m GTR+F+G4 –quiet -nt 1. As with SPRTA, we always use the tree estimated by MAPLE as a starting tree (supplied via the option -t) since on these datasets IQ-TREE will typically not converge to a tree with likelihood as high as MAPLE⁹. We used the following additional IQ-TREE options:

-B 1000 (1,000 bootstrap replicates) for UFBoot2⁷
–fast -b 100 (100 bootstrap replicates and fast tree search) for Felsenstein’s bootstrap²
–fast -b 100 –tbe (100 bootstrap replicates and fast tree search) for TBE¹⁰
–fast –alrt 1000 (1,000 bootstrap replicates and fast tree search) for aLRT-SH¹⁶
–fast –alrt 0 (fast tree search) for aLRT¹⁵
–fast –abayes (fast tree search) for aBayes¹⁷
–fast –lbp 1000 (1,000 bootstrap replicates and fast tree search) for LBP¹⁴.

The number of bootstrap replicates has very limited impact on the computational demand of UFBoot2 and LBP, hence these were set to 1,000 for reduced stochasticity with minimal computational cost.

SARS-CoV-2 genome datasets

Viridian genome dataset

We applied SPRTA to a SARS-CoV-2 dataset containing 2,072,111 genomes collected up to February 2023. The consensus sequences of these genomes were consistently called with Viridian, a tool that prevents common reference biases in genomic regions of low sequencing coverage²³. Furthermore, we filtered out potentially contaminated samples, and masked alignment columns affected by recurrent sequence errors²¹. We estimated a phylogenetic tree using MAPLE v.0.6.8 with an UNREST substitution model, rate variation, and deep SPR phylogenetic search. For a full description of data preparation and phylogenetic inference, see ref. ²¹. We ran SPRTA on this alignment and tree using MAPLE v.0.6.9, with the options described in ‘Implementation and usage of SPRTA’, and additionally with option –supportFor0Branches to evaluate the support of all genome placements, even those not involving mutations (see ‘Uncertainty of SARS-CoV-2 evolution’).

Simulated genomes

For benchmarking, we simulated SARS-CoV-2 genomes evolving along a known (‘true’) background phylogeny. The background tree we used was the publicly available 26 October 2021 global SARS-CoV-2 phylogenetic tree from http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/⁴¹ representing the evolutionary relationship of 2,250,054 SARS-CoV-2 genomes, as inferred using UShER⁸.

We used phastSim v.0.0.3⁴² with options –treeFile public-latest.all.nwk –scale 0.00003344 –reference MN908947.3.fasta –alpha 0.2 –createNewick to simulate sequence evolution along this tree according to SARS-CoV-2 non-stationary neutral mutation rates²², using the SARS-CoV-2 Wuhan-Hu-1 genome⁴³ as root sequence, and with gamma-distributed (α = 0.2) rate variation⁴⁴ (similar to that estimated from real data²¹). These simulations of complete genomes were used for Figs. 2 and 3, and Extended Data Figs. 1e, 2 and 3.

We also created a second set of simulations mimicking the distribution of genome incompleteness from real data as in ref. ⁹. In each simulated sequence we included N and gap (–) characters copied in number and location from a randomly selected paired sequence from the real SARS-CoV-2 genome dataset considered in ref. ⁹. This step simulates the distribution of missing sequence data due to low sequencing depth at certain specific genome regions. Additionally, in each simulated sequence we masked a number of randomly selected SNPs (differences with respect to the reference genome) equal in number to the isolated ambiguous characters in the paired randomly sampled real sequence. This additional step mimics the pattern caused by mixed infections and contamination, in which phylogenetically informative positions are selectively masked in consensus genomes due to within-sample heterozygosity (see ref. ⁹ for more detail). This second set of simulations was used to create Extended Data Figs. 1a–d,f, 4 and 5.

Assessing the impact on mutation rates

To assess the impact of phylogenetic uncertainty on estimates of mutation patterns, we mimicked typical studies of mutation rate inference in SARS-CoV-2^22,45,46. We consider real SARS-CoV-2 data and the corresponding inferred tree and mutation events as described in ‘Viridian genome dataset’. We then created three datasets: one containing all inferred mutations, one containing only mutations on branches with at least 50% SPRTA support (representing a mildly conservative approach, discarding highly uncertain mutations), and one with only mutations on branches with at least 90% SPRTA support (representing a highly conservative approach, removing any moderately uncertain mutation).

Equilibrium frequencies (Extended Data Fig. 6a) represent the equilibrium nucleotide distribution of the Markov chain defined by the genome-wide mutation rate matrix⁴⁷ inferred from the mutation counts as

$$q_ij\propto \fracn_ij\rm\pi _i,$$

(5)

where n_ij is the count of mutations from nucleotide i to nucleotide j and π_i is the frequency of nucleotide i in the reference genome.

We assess the impact of phylogenetic uncertainty on site-specific mutation counts by calculating, for every genome position, the ratio of the most conservative mutation counts (those on branches with at least 90% support) to the least conservative mutation counts (those on all branches) (Extended Data Fig. 6b). To avoid high variance in values of this ratio at sites with low numbers of substitutions, only sites with at least 50 mutations of the given type over all branches were included.

Assessing the impact on Pango lineages

To assess the impact of phylogenetic uncertainty on the definition and the inference of the origin of Pango lineages, we mapped 1,542 Pango lineage consensus genomes (as of 28 February 2023; https://github.com/corneliusroemer/pango-sequences) onto our SARS-CoV-2 phylogenetic tree (‘Viridian genome dataset’) using MAPLE v.0.7.3. This mapping associates Pango lineages with nodes in our tree. Similarly to our SARS-CoV-2 alignment, we masked regions of recurrent sequence errors from these consensus genomes (see ref. ²¹). Of these genomes, 1,127 mapped onto our tree with ≤1 mutation separating them from the tree, and with ≥95% SPRTA placement support score. We discarded consensus genomes that were further removed (>1 mutation separating them from the tree), as they typically represent more-recent lineages that did not exist at the time our dataset was collected. Consensus genomes with low SPRTA placement support were discarded since they could not be uniquely associated with a single node in our tree (this uncertainty can be caused, for example, by incomplete genome sequences in our dataset). In total, these 1,127 genomes mapped onto 1,117 distinct tree branches, and represent the Pango lineages that we can place on the tree with high confidence.

We used these 1,127 consensus genome placements to assign a Pango lineage to each branch and sample in our tree: each was assigned to the lineage represented by the closest ancestral consensus genome placement (the first one met moving from the considered node towards the tree root). We used this assignment of Pango lineages to assess uncertainty in the lineage assignment of the genomes in our dataset. For each of the 2,072,111 genomes in our alignment, we considered the SPRTA scores of their current and alternative placements. To each considered placement, we assigned the Pango lineage of the placement branch (alternative placements on the same branch as the placement of a Pango consensus genome were ignored).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Assessing phylogenetic confidence at pandemic scales

SPRTA and aBayes

Benchmarking of SPRTA support

Implementation and usage of SPRTA

Other branch support methods

SARS-CoV-2 genome datasets

Viridian genome dataset

Simulated genomes

Assessing the impact on mutation rates

Assessing the impact on Pango lineages

Reporting summary

The Nature Podcast highlights of 2025

Seeding opportunities for Black atmospheric scientists

Our top 5 climate stories of 2025

Most Popular

Abortion At VA Banned By Trump Administration

FBI Uncovers More Than One Million Additional Epstein Docs

Italy Introduces Duty on Ultra-Fast Fashion Products

The year data centers went from backend to center stage

Recent Comments

ABOUT US

POPULAR POSTS

Abortion At VA Banned By Trump Administration

FBI Uncovers More Than One Million Additional Epstein Docs

Italy Introduces Duty on Ultra-Fast Fashion Products

POPULAR CATEGORY