Gene ancestries reveal diverse microbial associations during eukaryogenesis

June 11, 2026

55

Data retrieval, curation and proteome database reconstruction

eTOLDB construction

To guide dataset selection and LECA reconstruction, we used the phylogenetic backbone of the eTOL and the nine eukaryotic supergroups classification depicted in Supplementary Fig. 1a. This represented a consensus between recent eTOL reconstructions^{54,55,56,57,58} and an operational definition of two stems and nine supergroups that largely reflected previous classifications⁵⁹. Further details are provided in the Supplementary Methods.

We manually selected and retrieved 276 eukaryotic proteomes (Supplementary Table 1; see Zenodo repository²⁴ and Supplementary Information) downloaded from NCBI⁶⁰, EukProt⁶¹, P10K⁶²,UniProt⁶³, Ensembl⁶⁴ and SGD⁶⁵, aiming to cover all supergroups represented in our reference phylogenetic backbone (Supplementary Fig. 1). We aimed for a balanced representation of divisions and prioritized the most complete and highest-quality genomes for each division on the basis of information available in the hosting databases. For each proteome, we kept only the longest isoform per gene, as indicated in the GFF files or FASTA headers. The isoform-free proteomes were then analysed for completeness using BUSCO v.5.5 (ref. ⁶⁶) with default parameters and eukaryotic_odb10, and we extracted basic statistics using seqkit stats v.2.8.2 (ref. ⁶⁷). To assess for the presence of uncertain amino acids (Xs) and low-complexity regions in the proteomes, we ran SEG⁶⁸ with default parameters. We used an in-house script (‘Code availability’) to trim the low-complexity ends of the proteins and calculated the coverage of low-complexity regions for each protein. If the coverage of low-complexity regions accounted for more than 25% of a protein, we excluded that protein from downstream analyses. Moreover, we calculated the proportion of low-complexity proteins in the whole proteome as a measure of the quality of the protein set and also discarded proteins that were larger than 10,000 amino acids or smaller than 50 amino acids. This resulted in a set of clean proteomes. To remove redundancy and recent duplications (with respect to eukaryotic evolution), we clustered each proteome individually using MMseqs v.2 (ref. ⁶⁹) with strict thresholds: minimum amino acid identity percentage of 80%, minimum target coverage of 0.5 and coverage mode 2 (–min-seq-id 0.8 -c 0.5–cov-mode 2). The representative sequences of the clusters constituted the final proteome. To assess the effects of the cleaning and clustering steps, we again ran BUSCO v.5.5 and seqkit stats v.2.8.2. We discarded proteomes that showed a combined reduction in number of proteins greater than 50%, those with a proportion of low-complexity proteins greater than 50% and those with BUSCO completeness reduction greater than 30%. This resulted in exclusion of 30 proteomes, with 256 proteomes remaining.

We built three eTOL datasets of 100 proteomes each, which we named eTOLDBA, eTOLDBB and eTOLDBC. For this, we first classified the 256 proteomes on the basis of taxonomy and distributed them among 9 different supergroups and 28 divisions (Supplementary Fig. 1). When fewer than four species were available for a division, we included all in the three eTOLDB databases. For the remaining divisions, we selected different, partially overlapping species sets for each eTOLDB. eTOLDBA contained, whenever possible, the most complete proteomes and the fewest parasites. More incomplete proteomes were added in turn to eTOLDBB and in eTOLDBC. Proteomes for model species Homo sapiens, Arabidopsis thaliana and Saccharomyces cerevisiae were included in the three databases despite being part of large taxonomic groups. This was done for later annotation purposes. The list of chosen species for each dataset can be found in Supplementary Table 1. Finally, a total of 185 species were present in the three eTOLDBs with overlaps of 58 species between eTOLDBA and eTOLDBB, and 53 species between eTOLDBA and eTOLDBC and between eTOLDBB and eTOLDBC. This resulted in an average protein overlap of 46% (Supplementary Fig. 1).

BroadDB construction

To ensure even sampling of the prokaryotic genome repertoire while minimizing the impact of intradomain HGT as a confounding factor of prokaryotic ancestry, we downsampled available prokaryotic genomes using a pangenome approach. For this, we took all species representatives in GTDB r207 (ref. ²²) (62,291 bacteria and 3,412 archaea; see Zenodo repository²⁴) and selected one genome per genus (sensu GTDB), choosing the one with the maximal CheckM⁷⁰ completeness and, in cases of ties, minimal CheckM contamination. The median CheckM completeness was 94.50%, and the median contamination was 0.990%. The proteomes of each genus representative were then grouped at order level and clustered using mmseqs2 easy-linclust⁶⁹ (–min-seq-id 0.3 -c 0.5–cov-mode 0–cluster-mode 2 -e 0.001). We kept a sequence representative for clusters present in 15% or more of genus representatives in that order, as 15% is the cutoff for the ‘cloud’ component of the pangenome⁷¹. To this dataset we added representative sequences from 1,319,927 viral protein clusters at 95% identity percentage available at RVDB v.25.0 (ref. ²³) (see Zenodo repository²⁴). For taxonomic assignment of the sequences, we followed the GTDB and RVDB taxonomies.

Reconstruction of LECA

LECA-OG reconstruction

We ran OrthoFinder v.2.5.5 (ref. ⁷²), using BLAST v2.15.0+⁷³ for the homology search with the filter for low complexity (-seg yes). Before the clustering step, BLAST results were filtered to remove hits with an e-value greater than 1 × 10⁻⁵ and an overlap less than 0.5. We also removed hits with disparate sizes (when the hit was more than 2.5 longer or less than 0.5 shorter than the query) to avoid artefactual clustering of families due to domain sharing. The clustering part of OrthoFinder was executed using two different inflation parameters (I = 1.5 and I = 3.0). Comparison between the two datasets showed that I = 3.0 gave 30,000 more groups on average than I = 1.5. When we filtered to retain only LECA groups, there was an increase of 990 groups on average between I = 3.0 and I = 1.5. Given the small difference, we chose to continue with the larger groups (I = 1.5), as they could be split on the basis of tree topology at a later point.

OGs as defined by OrthoFinder were filtered to select those likely to be present in the LECA (LECA-OG) on the basis of their distribution across the eTOL. LECA-OGs had to comprise proteins from at least five species and to have representatives of the two stems and three or more different supergroups for a relaxed definition of LECA and five or more supergroups for a strict definition of LECA. The same criteria for identification of the LECA were used in downstream analyses.

Contamination assessment and filtering of LECA-OG sequences

To assess potential non-eukaryotic contaminant proteins, we used taxonomy-based prok-euk contamination detection. First, we built a broad database to assess contamination. This database contained all prokaryotes from the reference release of GTDB r222 (ref. ²²) and all eukaryotic proteomes in UniProt⁶³, EukProt⁶¹ and a selection of the P10K⁶² project proteomes. To reduce redundancy and the size of the database, we clustered the proteomes of each class using mmseqs v.2 (ref. ⁷⁴) with the following thresholds: 75% identity and 66% coverage. To flag proteins as potential contaminants, we followed the approach described in ref. ⁵⁸. We performed a diamond blastp v.2.1.9 (ref. ⁷⁵) search against a broad clustered database using an e-value threshold of 1 × 10⁻³, retrieving a maximum of 15 sequences. We considered a protein to be a contaminant if 85% of its first 15 hits were prokaryotic and they showed an identity higher than 85%.

To assess the possibility of eukaryotic contamination, we used two decontamination strategies: taxonomy-based and identity-based euk-euk contamination detection. For the taxonomy-based strategy, we built a database using the eTOLDB sequences and performed an all-versus-all diamond blastp v.2.1.9 search. We retrieved the 15 first hits with identity higher than 85% and considered as contaminants sequences with 85% of their first hits from the opposite stem (following the taxonomic framework in Supplementary Fig. 1). In the identity-based strategy, sequences from different stems with a blast based identity above 95% were considered contaminants, as we could not establish the directionality of the contamination.

Finally, to remove putative contaminants within the same LECA-OG, we performed a multiple sequence alignment of the OG and then calculated pairwise identities between all the sequences of the OG belonging to different stems. Sequences with at least 85% global pairwise identity were considered contaminants. We removed all the sequences considered to be contaminants from the LECA-OGs and reconsidered which LECA-OGs passed the LECA criteria with the remaining sequences.

Expanding LECA-OGs with closest non-eukaryotic homologues

We aligned the sequences with MAFFT v.7.525 (ref. ⁷⁶) and built a profile with HMMer’s HMMbuild v.3.3.2 (ref. ⁷⁷). We ran hmmsearch v.3.3.2 with this profile against the broadDB described above. Results were filtered using an e-value threshold of 0.1 for both the complete sequence and the first domain found. Then, to ensure that the coverage requirements were maintained, we performed a BLASTP v.2.15.0+⁷³ search for each resulting non-eukaryotic sequence against the database of eukaryotic sequences from the OG. BLAST results were filtered with an e-value threshold of 1 × 10⁻⁵ and a coverage threshold of 0.5 over the full length of the non-eukaryotic sequence, ensuring that the target sequence was not 2.5 times longer or 2.5 times shorter than the query sequence. The top hits that passed these filters were considered to be the closest non-eukaryotic homologues and were added to the sequences of the corresponding LECA-OG for computation of the phylogenies. The number of non-eukaryotic sequences in the trees was limited in the following way: if there were fewer than 400 eukaryotic sequences, the non-eukaryotic sequences were added until there were 500 sequences. When more than 400 eukaryotic sequences were present, up to 100 non-eukaryotic sequences were added. Eukaryotic groups with more than 500 sequences were processed differently. First, genes were assigned a KO (see below); then, sequences from the same OG with the same KO were grouped together and processed as separate OGs.

Phylogenetic tree reconstructions

We used the PhylomeDB tree reconstruction pipeline⁷⁸ with slight modifications. In brief, we computed a consensus alignment using MergeAlign v.3 (ref. ⁷⁹) from alignments obtained using three different alignment programs (MUSCLE v.5.2 (ref. ⁸⁰), MAFFT v.7.525 (ref. ⁷⁶) and Kalign v.3.4.0 (ref. ⁸¹)) in forward and reverse orientations. This alignment was then trimmed using trimAl v.1.4.rev15 (ref. ⁸²) with the -gappyout option. Finally, a maximum likelihood tree was reconstructed using IQ-TREE v.3.0.1 (ref. ⁸³) and parameters –mset DCmut, JTTDCMut, VT, WAG, LG –madd LG4X, LG4M, LG + C10 + R4, LG + C20 + R4, LG + C30 + R4, LG + C40 + R4 –mrate E, G, R, I, G + I, R + I –mfreq FU, F -B 1000 –bnni –alrt 1000 –boot-trees –wbtl. Among the available substitution models, we selected single matrix models DCmut⁸⁴, JTTDCMut⁸⁴, VT⁸⁵, WAG⁸⁶ and LG⁸⁷ and mixture models LG4X, LG4M⁸⁸, LG + C10 + R4, LG + C20 + R4, LG + C30 + R4 and LG + C40 + R4 (ref. ⁸⁹). We included different rate heterogeneity parameters (lack of heterogeneity (E), gamma distributed rates (G), the FreeRate model (R), and their combinations with the proportion of invariant sites (I)). We further added two different equilibrium frequency parameters, that provided by the model (FU) and empirical frequencies (F). ModelFinder was used to select from the combinations of all the models and parameters. Branch support was assessed with 1,000 ultrafast bootstrap replicates optimized using nearest-neighbour interchange correction (the –bnni option)⁹⁰ and the Shimodaira–Hasegawa approximate likelihood ratio test⁹¹.

mLECA-OGs

For each LECA-OG, we reconstructed a phylogeny following the phylogenetic reconstruction pipeline detailed above. We annotated the tree with the taxonomy of each group and retrieved the largest monophyletic group containing only eukaryotic sequences and fulfilling the relaxed LECA criterion (five species, three supergroups and two stems). This was considered to be the ‘starting LECA node’. We rooted the phylogeny in the non-eukaryotic node farthest from the starting LECA node. This rooting method was shown to be the one least likely to have a negative impact on the selection of the LECA group and its sister owing to the lack of a proper outgroup (Supplementary Methods). We next iterated through the sister clades of the starting LECA node and incorporated them into the LECA clade if:

1.

the sister was mostly eukaryotic: that is, the fraction of non-eukaryotic sequences was less than or equal to 0.5, for sister clades with up to 20 sequences, or less than or equal to 0.25 if the sister had more than 20 tips.
2.

If the first sister was mainly non-eukaryotic, but the second sister was mostly eukaryotic (according to the criteria above), the first sister was considered to be a transfer from the eukaryotic clade. In this case, we incorporated the first sister if all the following conditions were met:

a.

the first non-eukaryotic sister had a number of sequences less than 75% of the number of sequences in the LECA clade;
b.

the number of non-eukaryotic phyla was less than 3 (assuming low diversity of the first non-eukaryotic sister); and
c.

the fraction of non-eukaryotic sequences in the second sister was less than 25%.

Once the LECA clade had been defined using the above criteria, we removed it from the tree and performed the search iteratively until no more starting LECA nodes could be found in the tree.

For each LECA node, we selected only the eukaryotic sequences (mLECA-OG hereafter). Finally, we aligned these eukaryotic filtered sequences to compute a profile-based search of non-eukaryotic homologues using the abovementioned criteria. We reconstructed the tree of the resulting set of eukaryotic and non-eukaryotic sequences using the phylogenetic reconstruction described above and identified LECA nodes in this tree following the described procedure. This resulted in a final set of mLECA-OGs that was a refinement of the previous one. Only this final set was used in subsequent analyses.

mLECA-OGs were scanned for the possibility that they emerged owing to HGT events between eukaryotic supergroups. For this, we located the LECA node in the phylogenetic tree, which by definition was ancestral to sequences from stem 1 and stem 2. We then inspected the two clades defined by the two first children nodes of the LECA node to determine whether either of the two nodes contained species from both stem 1 and stem 2 and thus contained the root of the eukaryotic tree. In 97% of the trees, at least one of the children nodes satisfied this condition, making the LECA node unlikely to be the result of an HGT event between diverging supergroup taxa. Given that 29% of the LECA nodes were duplication nodes according to the species overlap algorithm, we searched for the first speciation node within the LECA node and repeated the analysis. In this case, 96% of the trees had children nodes that mapped to the root of eukaryotes, validating the results.

mLECA-OGs were also analysed to assess the possibility of artificial splitting by OrthoFinder by clustering with MCL mLECA-OGs sharing at least 50% of their sister groups (Supplementary Methods); this resulted in minimal difference in our downstream inferences.

Functional annotations of mLECA-OGs

We annotated all the proteins with KofamScan v.1.3.0 (https://github.com/takaram/kofam_scan)⁹² and hmmsearch against the COG database (https://ftp.ncbi.nlm.nih.gov/pub/COG/COG2024/data/). Results from KofamScan were filtered in the following way. First, we selected the hits considered significant by KofamScan. When multiple hits were found, we selected the hit with the lowest e-value and highest hmmsearch score. In addition, for proteins with no hits passing the significance threshold of KofamScan, we selected the hit with lowest e-value and highest hmmsearch score. To benchmark the accuracy of our predictions, we used 36 proteomes in our dataset that had been obtained from UniProt and had mappings to genes in KEGG. This enabled us to establish KOs for a total of 154,248 proteins, independent of KofamScan. For those proteins, we calculated the average correspondence of KO predictions per each species. We obtained 95% correct predictions on average, validating our KO annotation method. We further assigned each protein to a COG family by performing hmmsearch using the COG profiles and selecting the first hit on the basis of bit score. We kept those COGs with a bit score higher than the 90th percentile. Using the NCBI COG⁹³ definitions table (https://ftp.ncbi.nlm.nih.gov/pub/COG/COG2024/data/cog-24.def.tab), we assigned a COG category to each protein.

Once all proteins had been functionally annotated, we assigned a KO and a COG for each set of sequences (LECA-OGs, mLECA-OGs or donor group, which in virus-mediated acquisitions would be the first prokaryotic sister) using the same approach used to functionally annotate eggNOG OGs. This approach takes into account the proportions of the KO and COG within the set of sequences and their overrepresentation with respect to the background KO distribution (the whole set of annotated proteins). We chose a single KO and COG or multiple KOs and COGs in cases of ties by minimizing the sum of the KO rank in the proportions list and the overrepresentation list. To annotate the mLECA-OGs, we used the common KOs assigned to the LECA clade and its sister if they agreed, and that of the LECA clade if they disagreed (see Supplementary Methods and Supplementary Fig. 9 for details of the algorithm for assignation of KOs to a mLECA-OG). The mLECA-OGs with no non-eukaryotic homologues (innovations) with KOs found only in non-eukaryotic genomes were left unannotated. This can happen owing to local homology between a non-eukaryotic KO and a eukaryotic protein.

Inference of the LECA proteomes

For each eTOLDB version (A, B and C) and each LECA criterion (three supergroups and five supergroups), we obtained the set of mLECA-OGs that fulfilled the criteria and inferred their origins and KO annotations. We used these proteomes to infer the general proportion of acquisitions and innovations, as well as the general metabolism of the LECA. Furthermore, we calculated a consensus proteome with the KO annotations that were pervasive and well supported across the datasets to make inferences about LECA features and metabolism. For this consensus, we considered KOs if they were present in two of the three TOLDB datasets or supported by at least five supergroups in any TOLDB dataset.

Inference of LECA metabolism and features

We inferred the presence of metabolic pathways and their completeness for each LECA proteome (one per eTOLDB version and LECA criterion) using the anvi-estimate-metabolism module of the anvi’o v.8 (ref. ⁹⁴) package. Finally, we plotted a reconstructed metabolic map using ggKEGG v.1.2.3 (ref. ⁹⁵) (Supplementary Fig. 10).

For other molecular components that were not necessarily associated with a single KEGG map, such as the complexes that form the flagellum, the spliceosome or the cytoskeleton, we collected the set of genes associated with these components as depicted by KEGG BRITE, complemented them on the basis of the literature when necessary, and then checked the assigned origins of the different genes. We assessed the completeness of these features by extracting the subset of KOs that were general to eukaryotes (to minimize the effects of clade-specific KOs in the inference of completeness) and determined the proportions at which they were present in our reconstruction of the LECA by any database (for the origins) and the consensus proteome (for presence and absence of the protein).

Comparison with FLUPs, FLUOs and FLUAs

To benchmark the inferred LECA proteome, we compared it to the core and individual proteomes of selected unicellular, heterotrophic eukaryotic organisms. For this, we selected FLUPs, FLUAs and FLUOs from the eTOLDB proteomes, the P10K project⁶² and specific project repositories, (Supplementary Table 5). After cleaning and performing quality checks on the proteomes (‘eTOLDB construction’), we functionally annotated the proteomes following the same protein annotation pipeline as for the eTOLDB genomes (‘mLECA-OGs’).

On the basis of the KO predictions for each genome of the selected FLUPs, FLUOs and FLUAs, we calculated the percentage of each genome that was shared with the LECA in terms of number of KOs (number of shared KOs as a percentage of the number of KOs predicted in the eukaryotic proteome). We also obtained the frequency with which each COG functional category appeared in each genome in the three datasets as the number of genes that had a COG functional category divided by the number of genes with at least one COG category assigned.

Tracing mLECA-OG ancestries

Identifying the origins of mLECA-OGs

First, we analysed each tree by identifying the type of gene family and assigning it to one of three categories: innovations, acquisitions and unknown. We considered mLECA-OGs without non-eukaryotic homologues to be innovations, whereas those that had the LECA group nested within non-eukaryotic sequences were considered acquisitions. Among the acquisitions, we identified two subcategories: viral-mediated acquisitions, which were those pointing to a viral first sister followed by prokaryotic sisters; and direct acquisitions, which were inferred to have acquired the gene directly from prokaryotes. We established another category for those trees whose directionality was not clear (unknown-origin mLECA-OGs). We classified trees as having an unknown origin when they had only viral homologues, fewer than three non-eukaryotic orders or fewer than five sequences. It is important to note, with respect to these thresholds, that the breadth of prokaryotic sequences was heavily downsampled by the pangenomic approach. The sequences present in the sisters do not represent only themselves but also a cluster of other prokaryotic sequences. Therefore, each LECA clade may have been deeper within the prokaryotic group and nested, although we only detected sister-to relationships.

To detect the non-eukaryotic groups that transferred a given gene family to the protoeukaryote, we analysed the first sister of the mLECA-OG in the case of direct acquisitions and the first non-viral sister for virus-mediated acquisitions. First, we discarded a non-eukaryotic origin for the mLECA-OGs for those trees in which the sister clade contained more than 25% eukaryotic sequences, as although the sister would be mainly non-eukaryotic, the presence of eukaryotic sequences could have been the result of secondary transfers from eukaryotes to a broad set of bacteria; this would hamper identification of a confident sister. For the trees that passed this initial filter, we retrieved the proportion of each non-eukaryotic group present in the sister clade. Finally, to assign a taxon to the sister clade, we selected the most abundant group in the first sister (that is, the one with the maximum proportion). In the case of more than one non-eukaryotic group with maximum proportion in the first sister, we assumed that the clade was mixed and calculated its common ancestor; this could be Bacteria, Archaea, Viruses or a mixture of them.

Trees in which the first sister to the mLECA-OG was assigned to Nucleocytoviricota were assessed in more detail to assess the directionality of the transfer or the presence of secondary transfers from other groups. For this, we scanned the trees to search for other, non-viral sequences. For each tree with a sister in Nucleocytoviricota, we located the point at which the LECA and its closest sister were located and went back in the tree towards the root to locate the first sister for which the largest signal did not belong to viruses. When such cases were found, the taxonomy of that sister was assessed and considered as the potential source of genes transferred to the LECA through the viral lineage. Trees with only viruses as sisters and no other non-eukaryotic signal in the tree were classified as of unknown origin, as it was difficult to assess whether they were the result of a more recent HGT event from eukaryotes to viruses.

‘Stress test’ for identification of main non-eukaryotic contributors

To assess the congruence of the LECA and its relationship to its sister, we checked the ultrafast bootstrap support (UfBS) values of the branch separating the LECA and its first sister (Supplementary Fig. 11). However, the evolutionary depth of this project required a bootstrap support measure that accounted for broader taxonomic patterns rather than specific sequences moving through different nodes. Therefore, we designed the taxonomic bootstrap, a measure of the support of a given taxonomic affiliation for a certain node of the tree. This version of the bootstrap calculates the proportion of ultrafast bootstrap trees that support the same taxonomic assignment for the sister of the maximum likelihood tree regardless of the congruence of the sequences. In this case, we analysed the taxonomic bootstrap of the clade sister to LECA. To gain further insights about the assigned donor, we checked its proportion in the first sister and its presence in the second sister.

We used these measures to obtain the set of non-eukaryotic non-negligible donors beyond the assumed contributors (Alphaproteobacteria and Asgard archaea). We first established a set of parameters that were important for establishing reliable donors: the UfBS, the taxonomic bootstrap described above, the LECA criteria (3–9 supergroups) and the donor certainty (the proportion of the donor in the first sister). We then scanned the whole set of trees to obtain the thresholds for the parameters that maximized the proportion of trees assigned to Alphaproteobacteria and Asgard archaea, assuming that the selected thresholds properly removed noise and kept only the reliable signals. Then, we checked which other donors kept a substantial number of trees using the same thresholds. Although the maximum Alphaproteobacteria and Asgard archaea proportion was given when using an UfBS threshold lower than 75%, we searched for the next maximum with at least 75% UfBS (Supplementary Fig. 4). The resulting thresholds that maximized the proportion of assignments to Alphaproteobacteria and Asgard archaea were: (i) a UfBS higher than 75%, (ii) a taxonomic bootstrap support higher than 95%, (iii) 7 supergroups criterion for identification of the LECA, and (iv) a donor certainty higher than 95%. This resulted in identification of three contributors beyond Alphaproteobacteria and Asgardarchaeota: two bacterial clades (Planctomycetota and Myxococcota) and a viral phylum (Nucleocytoviricota).

Verticality test

To assess the possibility that bacterial signals emerged through vertical evolution of alphaproteobacterial donors, we repeated the whole LECA prediction process as described above. First, we collected from the GTDB species representatives a set of 35 representative alphaproteobacterial proteomes. We selected a representative proteome per order (maximum CheckM completeness and, in case of ties, minimal CheckM contamination) and then randomly selected 35 orders out of this set. Then, we filtered the pangenome to exclude all alphaproteobacterial protein sequences. Tree reconstruction and identification of ancestral alphaproteobacterial groups were done as described above. An ancestral group was defined by the presence of five Alphaproteobacteria species, depending on the dataset. Then, we identified the sister groups to those ancestral groups and classified them into the groups derived from the GTDB bacterial tree (for instance, S1 was represented by Gammaproteobacteria + Magnetococcia + Zetaproteobacteria). Then, we counted the number of trees that had a sister in each of the categories (Supplementary Fig. 4b). These values served as the baseline vertical signal for each of the main donor groups and represent the expected proportion of trees associated with secondary donors rather than the main alphaproteobacterial donor (Supplementary Fig. 4c).

Inferring the acquisition waves

We defined a LECA gene acquisition wave as a set of genes potentially acquired from the same donor during the same period of time. To identify such acquisition waves, we computed the normalized stem lengths as in ref. ⁴ and plotted the distributions of these branch length values with respect to the taxonomic affiliation obtained according to the above procedure. For duplications occurring during eukaryogenesis and resulting in more than one LECA group, we considered the LECA node with the shortest branch length to the non-eukaryotic sister, as in ref. ²¹. This resulted in distributions with different modes for each taxonomic affiliation (Supplementary Fig. 12), whereas applying this approach to trees for OGs annotated to different functions resulted in overlapping distributions with similar modes (Supplementary Fig. 13). We inferred the acquisition module using the density function; the genes that were around the peak of the distribution were selected as members of the acquisition module. To obtain these genes, we set the threshold to the first minimum after the first maximum or, in its absence, the first inflexion point of the density curve (dashed lines in Supplementary Fig. 12). Genes that transferred before the threshold (that is, those that were in the peak) were included in the module. By contrast, those with longer stem lengths were discarded, as we assumed that they were transferred posteriorly or the values resulted from heterotachy or ghost lineages^28,50.

To infer the relative timing of the transfers and their timeline, we inferred the mode of the stem length distribution using a Bayesian framework²⁸. In brief, this process uses a Markov chain Monte Carlo algorithm (JAGS v.4.3.2) to infer a posterior distribution for the parameters, mode and variance of a gamma distribution, which has been shown to properly fit the empirical distribution of the stem lengths²⁸. We used a pairwise comparison of the posterior distributions of the mode from different donors (assumed as relative ages) to establish a probabilistic timeline for the different transfers to the protoeukaryote.

Inferring the metabolism and features of the donors

To infer the ancestral features of the organisms that transferred genes to the LECA, for a given donor group, we first selected from its genus representatives the genomes sharing at least 50% of the genes transferred from that group to the LECA in its acquisition wave and annotated the genomes (Supplementary Fig. 14). These genomes constituted the ‘donor-like’ genomes set. Second, we calculated the frequency of each KO present in these genomes. Then, we ran anvi’o v.8 (ref. ⁹⁴) for each donor-like genome. For both the metabolic modules and the KOs, we calculated their prevalence (Supplementary Fig. 15), that is, the proportion of genomes containing a given KO or module. In the case of the metabolic modules, we transformed the prevalence by multiplying it by the mean module completeness in all the donor’s descendant genomes to take into account the completeness of the module. Using this metabolic reconstruction, we plotted the inferred donor’s metabolism using ggKEGG v.1.2.3 (ref. ⁹⁵). We further annotated each protein of the donor-like genome using hmmsearch against the COG database and related the results to the COG categories. Using the annotated COG categories, we calculated the overrepresentations of each COG category proportion for a given donor with respect to the LECA consensus proteome. In this case, overrepresentation against the LECA consensus proteome would mean functional enrichment of the donor’s contribution to the LECA. Using the modules and the KO prevalence, we assessed the presence and absence patterns of features relevant to the different eukaryogenesis models for all the donors. No statistical contrast for the enrichment was used, as the sample sizes were non-homogeneous and therefore did not provide enough statistical power, although we could study the general trends.

Inferring eukaryotic signature proteins

To define a new set of eukaryotic signature proteins, as defined in ref. ²⁶, we selected the KOs from the consensus proteome that were exclusively found as innovations.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Gene ancestries reveal diverse microbial associations during eukaryogenesis

Data retrieval, curation and proteome database reconstruction

eTOLDB construction

BroadDB construction

Reconstruction of LECA

LECA-OG reconstruction

Contamination assessment and filtering of LECA-OG sequences

Expanding LECA-OGs with closest non-eukaryotic homologues

Phylogenetic tree reconstructions

mLECA-OGs

Functional annotations of mLECA-OGs

Inference of the LECA proteomes

Inference of LECA metabolism and features

Comparison with FLUPs, FLUOs and FLUAs

Tracing mLECA-OG ancestries

Identifying the origins of mLECA-OGs

‘Stress test’ for identification of main non-eukaryotic contributors

Verticality test

Inferring the acquisition waves

Inferring the metabolism and features of the donors

Inferring eukaryotic signature proteins

Reporting summary

Daily briefing: The science of The Odyssey

Inside the high-stakes race to develop an Ebola vaccine for an escalating outbreak

The town heated by one big battery

Most Popular

The Edit: J.Crew’s August Drop

Listen to King Gizzard & the Lizard Wizard’s New Song “Alien Metal”

Shipt Inks First Footwear Retail Partnership With DSW

3 Reasons Innovation Deserves a Line in Your Budget

Recent Comments

ABOUT US

POPULAR POSTS

The Edit: J.Crew’s August Drop

Listen to King Gizzard & the Lizard Wizard’s New Song “Alien Metal”

Shipt Inks First Footwear Retail Partnership With DSW

POPULAR CATEGORY