Identification of SBP genes
Nineteen candidate SBP genes in the genome of Ca. P. ubique strain HTCC1062 were identified through a search of the TransportDB 2.0 database59 (http://membranetransport.org; accessed 22 January 2020). One of these genes, SAR11_0371, was annotated as a âpossible transmembrane receptorâ in UniProt and showed a non-canonical predicted domain structure consisting of a short SBP-like domain (170 amino acids) followed by a coiled coil domain and unidentified C-terminal domain. Additionally, genome context analysis showed that, unlike the other ABC SBP genes in Ca. P. ubique HTCC1062, SAR11_0371 was not colocalized with genes encoding the membrane permease or ATP-binding cassette components of an ABC transport system. Thus, SAR11_0371 was considered not to represent the SBP component of an SBP-dependent transport system and was excluded from the analysis. We also attempted to identify additional SBP genes through a search of the UniProt database for proteins in Ca. P. ubique belonging to Pfam clans CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like); however, this search did not return any additional candidate genes.
Cloning
The protein sequence of each SBP from Ca. P. ubique HTCC1062 was obtained from the UniProt database. Signal sequences were predicted using the SignalP 5.0 server60 and removed. The protein sequences were then back-translated and codon-optimized for expression in E. coli, and the resulting genes were obtained as synthetic DNA from Twist Bioscience or Integrated DNA Technologies. The synthetic genes were cloned into the NdeI/XhoI site of the pET-28a(+) expression vector by In-Fusion cloning using the In-Fusion HD Cloning Kit (Takara Bio), yielding expression constructs with an N-terminal hexahistidine tag and thrombin tag. Correct assembly of each expression vector was confirmed by Sanger sequencing (FASMAC). The putative csiD gene, SAR11_1354, and several homologues of the Ca. P. ubique HTCC1062 SBPs (Supplementary Table 8) were cloned similarly into the pET-28a(+) vector, except that the thrombin tag was removed from the constructs of SAR11_1354, SAR11_0266 (Fub), or SAR11_1290 (SAR324). The sequences of oligonucleotides and synthetic genes used in this study are listed in Supplementary Table 9.
Optimization of protein expression
Protein expression was initially tested in E. coli BL21(DE3) cells grown in Luria-Bertani (LB) and Terrific Broth (TB) media at 30â°C and 17â°C. SAR11_0655 showed optimal soluble expression in LB medium at 17â°C, SAR11_1203 showed optimal soluble expression in TB medium at 30â°C, and 7 proteins (SAR11_0797, SAR11_0807, SAR11_0864, SAR11_1068, SAR11_1179, SAR11_1210, SAR11_1238, and SAR11_1361) showed optimal soluble expression in TB medium at 17â°C. Next, the remaining proteins were tested for expression in E. coli SHuffle T7 cells (New England Biolabs) in TB medium at 17â°C; this strain expresses the disulfide bond isomerase DsbC, which can increase soluble recombinant expression of cytoplasmic proteins by promoting correct formation of disulfide bonds. Soluble expression of SAR11_0769, SAR11_0953, SAR11_1302, and SAR11_1336 was achieved under these conditions. Due to the lack of soluble expression for the remaining four proteins (SAR11_0266, SAR11_0271, SAR11_1290 and SAR11_1346), we also tested expression of one or two close homologues of each protein (Supplementary Table 8). The SAR11_0271 homologue from âCa. Pelagibacterâ sp. HIMB1321 (denoted SAR11_0271*) could be expressed in soluble form in SHuffle T7 cells in TB medium at 17â°C, while the SAR11_1346 homologue from the same species (denoted SAR11_1346*) could be expressed in soluble form in BL21(DE3) cells in TB medium at 17â°C. SAR11_0271* and SAR11_1346* share 91.4% and 88.9% sequence identity, respectively, with the corresponding proteins from Ca. P. ubique HTCC1062, and the binding site residues are completely conserved (Supplementary Fig. 5), indicating that the functions and properties of the homologous SBPs are likely to be identical. Neither homologue of SAR11_0266 or SAR11_1290 could be expressed in soluble form in BL21(DE3) or SHuffle T7 cells. Expression of SAR11_0266 and SAR11_1290 without His6 or thrombin tags also yielded insoluble protein.
Protein expression was typically evaluated by SDSâPAGE analysis as follows. Cells transformed with the relevant expression vector by electroporation were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25âµgâmlâ1 kanamycin and incubated at 30â°C overnight. The cells were then scraped into a small volume of LB medium and used to inoculate 3âml of the relevant growth medium containing 25âµgâmlâ1 kanamycin in a 10âml round bottom tube at a starting OD600 of 0.05. The culture was incubated at 37â°C with shaking at 220ârpm until the OD600 reached 0.5. One-millilitre aliquots were transferred to clean round bottom tubes and isopropyl β-d-1-thiogalactopyranoside (IPTG) was added to a final concentration of 0.5âmM. The induced cultures were incubated with shaking at 220ârpm at 17â°C overnight or 30â°C for 3âh. A 500-µl aliquot of each culture was resuspended in lysis buffer (20âmM Tris, 0.5âM NaCl, 1% (v/v) Triton X-100, pH 8.0) and incubated at room temperature for 10âmin. The cell lysate was centrifuged at 21,000g for 5âmin (4â°C). The soluble fraction of the cell lysate was transferred to a tube containing 30âµl cOMPLETE His-Tag purification Ni-NTA resin (Roche) suspended in 500âµl buffer A (8âM urea, 20âmM Tris, 0.5âM NaCl, pH 8.0), while the insoluble fraction of the cell lysate was dissolved in 500âµl buffer A, centrifuged at 21,000g for 5âmin, and then transferred to a tube containing 30âµl Ni-NTA resin suspended in 500âµl buffer A. In both cases, the resin was incubated at room temperature for 10âmin, washed twice with 500âµl buffer A, and then eluted by incubation with 50âµl buffer B (8âM urea, 20âmM Tris, 0.5âM NaCl, 0.5âM imidazole, pH 8.0) at room temperature for 5âmin. Fifteen microliters of supernatant was mixed with 5âµl of 4à SDSâPAGE sample loading buffer and heated at 90â°C for 10âmin, then loaded onto a 4â15% pre-cast SDSâPAGE gel (Bio-Rad). The gel was run at 200âV for 30âmin and visualized with Coomassie Blue.
Large-scale protein expression and purification
For expression and purification of the Ca. P. ubique SBPs, E. coli BL21(DE3) or SHuffle T7 cells transformed with the relevant expression vector were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25âµgâmlâ1 kanamycin, and incubated at 30â°C overnight. The cells were then scraped into 3âml LB medium, and 500âµl of the resulting cell suspension was used to inoculate 500âml LB or TB medium supplemented with 25âµgâmlâ1 kanamycin in a 2âl or 3âl flask, preheated at 37â°C. The culture was incubated at 37â°C with shaking at 220ârpm until the OD600 reached 0.5, then cooled briefly in an ice-water bath until the temperature reached ~25â°C. IPTG was added to a concentration of 0.5âmM, and the culture was incubated at 17â°C with shaking at 220ârpm for a further 16âh. Cells were pelleted by centrifugation (3,300g, 15âmin, 4â°C) and frozen at â20â°C until use. For protein purification, cells were thawed on ice, resuspended in 100âml Ni binding buffer (20âmM Tris, 500âmM NaCl, 20âmM imidazole, pH 8.0), and lysed by sonication. After addition of 500 U Benzonase Nuclease (Sigma-Aldrich) to digest DNA, the cell lysate was centrifuged at 10,000g for 1âh (4â°C). The supernatant was filtered through a 0.45-µm syringe filter and then loaded onto a 1âml HisTrap HP column (Cytiva) equilibrated with Ni wash buffer using an ÃKTA Pure FPLC system (Cytiva). For purification under native conditions, the column was washed with 10âml Ni binding buffer followed by 10âml Ni wash buffer (20âmM Tris, 500âmM NaCl, 44âmM imidazole, pH 8.0), and then the target protein was eluted in 10âml Ni elution buffer (20âmM Tris, 500âmM NaCl, 500âmM imidazole, pH 8.0). For purification under denaturing conditions, the column was washed with denaturing Ni binding buffer (8âM urea, 20âmM Tris, 250âmM NaCl, 20âmM imidazole, pH 8.0) at 1âmlâminâ1 for 30âmin after loading of the clarified cell lysate, and the target protein was eluted with 10âml denaturing Ni elution buffer (8âM urea, 20âmM Tris, 250âmM NaCl, 250âmM imidazole, pH 8.0). Proteins purified under native conditions were concentrated to 400âµl using a 10âkDa molecular weight cut-off (MWCO) Amicon Ultra-4 centrifugal spin concentrator (Merck-Millipore) and purified by size-exclusion chromatography using a Superdex 200 Increase 10/300 column (Cytiva), eluting in DSF buffer (20âmM HEPES, 0.3âM NaCl, pH 7.50). For storage, proteins were concentrated to a volume of 0.5â2âml and glycerol was added to a concentration of 10% (v/v). The protein was then flash-frozen in 100â200-µl aliquots in liquid nitrogen and stored at â80â°C until use. ArgT from S. enterica was expressed from a pETMCSIII plasmid and purified as described previously61.
Protein refolding
In most cases, protein purified under denaturing conditions was diluted to a concentration of 0.5âmgâmlâ1 and volume of 10â30âml in denaturing Ni binding buffer (8âM urea, 20âmM Tris, 250âmM NaCl, 20âmM imidazole, pH 8.0) and transferred to 10âkDa MWCO SnakeSkin dialysis tubing (Thermo Scientific). The protein was then dialysed against 2âl dialysis buffer (20âmM Tris, 150âmM NaCl, pH 8.0) at 4â°C with three buffer changes over a period of 24âh. The protein was collected and exchanged into DSF buffer using a 10âkDa MWCO Amicon Ultra-15 centrifugal concentrator, then concentrated to 400âµl and purified by size-exclusion chromatography as described above. For SAR11_1346*, an improved yield of monomeric protein was obtained using the rapid dilution for refolding: 2âml of denatured protein (5âmgâmlâ1 in denaturing Ni binding buffer) was added dropwise with stirring to 40âml pre-chilled refolding buffer (20âmM Tris, 150âmM NaCl, 10% (v/v) glycerol, pH 8.0) and incubated at 4â°C with stirring for 20âh. The protein was then concentrated and purified by size-exclusion chromatography as above.
Differential scanning fluorimetry
DSF experiments were performed using a StepOnePlus Real-Time PCR System and StepOne software (Applied Biosystems) based on literature protocols62,63. Reaction mixtures were prepared in twin.tec Real-Time PCR Plates (Eppendorf) and contained 5à SYPRO Orange (Sigma-Aldrich), 2.5âµM protein, and 2âµl 10à ligand in a total volume of 20âµl DSF buffer. The plate was sealed with optically clear sealing film and centrifuged at 2,000g for 1âmin before loading into the real-time PCR instrument. The temperature was ramped at a rate of 1% (approximately 1.33â°Câminâ1), typically over a 60â°C window centred on the melting temperature (TM) of the target protein. Fluorescence was monitored using the ROX channel. TM values were determined by taking the derivative of fluorescence intensity with respect to temperature and fitting the resulting data to a quadratic equation in a 6â°C window in the vicinity of the TM in R software.
Proteins were initially screened for binding to metabolites in four Phenotype MicroArray plates, PM1 to PM4 (Biolog). The contents of each well were dissolved in 50âµl (PM1 to PM3) or 20âµl (PM4) sterile filtered water, giving a concentration of approximately 10â20âmM in each well63. The plates were then sealed with aluminium sealing films and stored at â80â°C. Prior to use, the plates were thawed at room temperature and then shaken at 30â°C until the compounds had redissolved. Two microliters of each compound was added to 18âµl reaction mixture prepared as described above. A 2â°C increase in TM compared with the median value across the plate was taken as indicative of binding63,64.
For screening of individual compounds and confirmatory assays, compounds were dissolved at a concentration of 100âmM in ligand buffer (0.1âM HEPES pH 7.5), and the pH was adjusted with 1âM NaOH or 1âM HCl if necessary (specifically, if the pH of a 10âmM solution of the compound diluted in DSF buffer fell outside the range 6.5â8.0). These stock solutions were stored at â20â°C. Two microlitres of each compound was directly added to 18âµl reaction mixture, giving a final concentration of 10âmM, or first diluted 10-fold or 100-fold in DSF buffer to give final concentrations of 1âmM or 0.1âmM in the assay. A list of chemicals used for screening, including the supplier and catalogue number, is provided in Supplementary Table 3. Sodium (R)- and (S)-2,3-dihydroxypropane-1-sulfonate were synthesized from (R)- and (S)-3-chloro-1,2-propanediol following a literature protocol65 and verified by 1H and 13C NMR.
In the case of the TRAP and TTT SBPs, SAR11_0864 and SAR11_1203, we hypothesized that a metal ion might be required for high-affinity binding, due to the biphasic melting curve observed in the presence of isethionate in Biolog screening experiments, suggesting the presence of a mixture of active and inactive protein (SAR11_0864) or due to the discord between the highly charged ligand and the largely uncharged binding site of the SBP (SAR11_1203). Therefore, we tested the effect of the addition of metal ions (Mg2+, Ca2+, K+, Zn2+, Mn2+, Co2+, Ni2+, Fe2+ and Fe3+) on binding of isethionate to SAR11_0864 and citrate to SAR11_1203 by DSF (Supplementary Fig. 6). DSF experiments were performed using refolded protein as described above, with the addition of 1âmM metal ion and 1âmM ligand. Based on these results, and considering the concentration of each metal ion in seawater66, 10âmM CaCl2 (SAR11_0864) or 53âmM MgSO4 (SAR11_1203) were included in subsequent DSF and ITC binding experiments for these SBPs.
Isothermal titration calorimetry
ITC experiments were performed using a MicroCal PEAQ-ITC system (Malvern Panalytical). Protein samples were refolded and freshly purified (not frozen), and protein and ligand samples were prepared in the same batch of DSF buffer used for size-exclusion chromatography to minimize the heat of dilution. For SAR11_0864 and SAR11_1203, calcium chloride (final concentration 10.3âmM) or magnesium sulfate (final concentration 53âmM), respectively, was added to the protein and ligand samples. Experiments were performed at 25â°C with stirring at 700ârpm and 10âµcalâsâ1 reference power. Titration parameters were varied depending on the protein yield, the fraction of active protein, and the affinity and enthalpy of the interaction. In a typical titration, 35âµM protein was titrated with 1Ãâ0.4-µl and 19Ãâ1.6-µl injections of ligand, with the ligand concentration chosen to give >1.5-fold molar excess of ligand to active protein at the end of the titration. ITC experiments were generally performed at least in duplicate.
For simple 1:1 binding interactions, the association constant (Ka), enthalpy (ÎH), and stoichiometry (n) of the interaction were determined by fitting the data to the one-set-of-sites model in MicroCal PEAQ-ITC analysis software. In the case of the SAR11_0769 + d-glucose interaction, thermodynamic parameters were estimated through Bayesian fitting to a modified competitive binding model, which incorporated an additional parameter to account for the fraction of the ligand in each anomeric form, and a two-sets-of-sites model implemented in pytc software67; the latter model is equivalent to the two-sets-of-sites model in the MicroCal software, except without the minor correction for heat associated with the displaced volume for each injection (for consistency with the other models in pytc). Thermodynamic parameters for the SAR11_0953 + l-glutamate, SAR11_1203 + citrate, SAR11_1210 + l-arginine, SAR11_1336 + glycine betaine, and SAR11_1346* + l-leucine interactions were determined through competitive displacement experiments68, in which l-phenylalanine, cis-aconitate, d-octopine, glycine, or l-serine (respectively) were included at a fixed concentration in the cell to reduce the apparent binding affinity for the ligand of interest. The data for these competitive binding experiments were analysed by Bayesian fitting to the competitive binding sites model in pytc software. To confirm the high affinity of the SAR11_1210 + l-arginine interaction, a competitive binding experiment was performed where SAR11_1210 and ArgT from S. enterica (which has a Kd of 15ânM for l-arginine) were included in the cell together at the same concentration (28âµM) and titrated with l-arginine. Similarly, for the SAR11_1210(E108A)â+âl-arginine interaction, a mixture of SAR11_1210(E108A) and SAR11_1210 (35âµM each) was titrated with l-arginine. For these titrations, the data was fitted to a two-sets-of-sites binding model as described above to obtain thermodynamic parameters for both proteinâligand interactions. For all analyses, the heat of dilution was assumed to be a small constant value and included as a fitted parameter in the model. The validity of this assumption was confirmed for each ligand by performing a control titration where the ligand was injected into DSF buffer.
Spectrophotometric analysis of iron(iii) binding
Binding of iron(iii) to SAR11_1238 was analysed using a spectrophotometric assay based on literature protocols69,70. UVâvis spectra were recorded at room temperature (25â°C) in a 96-well plate from 300ânm to 630ânm with 1ânm bandwidth using a Multiskan GO spectrophotometer (Thermo Scientific). An initial protein concentration of 100âµM and an initial volume of 200âµl were used for all spectrophotometric assays. First, purified SAR11_1238 was thawed and exchanged into 50âmM Tris, 200âmM NaCl buffer (pH 8.0) using a centrifugal concentrator, and the spectrum of the resulting protein sample was recorded. To prepare unliganded protein for iron-binding assays, the protein was exchanged into 50âmM Tris, 200âmM NaCl, 20âmM sodium citrate buffer (pH 8.0) by three rounds of 30-fold dilution and concentration, allowing chelation and removal of the metal ligand. Citrate was then removed by four rounds of 30-fold dilution and concentration with 50âmM Tris, 200âmM NaCl buffer (pH 8.0). Binding assays were performed by titrating the unliganded protein (200âµl of 100âµM solution) with 8à or 10à 5-µl injections of 800âµM iron(iii) solution, which was prepared from iron(iii) chloride and a 2.5-fold molar excess of trisodium citrate (which ensures that the iron(iii) remains soluble) in ultrapure water. To confirm that SAR11_1238 binds iron(iii) rather than the iron(iii)âcitrate complex, the protein was also titrated under the same conditions with 800âµM ammonium iron(II) sulfate; under the aerobic conditions of the assay, iron(ii) is rapidly oxidized to iron(iii)69. UVâvis spectra were recorded 1âmin (iron(ii)) or 15âmin (iron(iii)) after each injection. Finally, a competitive binding assay with citrate was used to estimate the affinity of SAR11_1238 for iron(iii). The protein was saturated with a twofold molar excess of iron(iii) solution, diluted to a volume of 1âml, and then dialysed against 500âml of 50âmM Tris, 200âmM NaCl buffer (pH 8.0) at 4â°C overnight to remove excess iron(iii) and citrate. The protein was then concentrated to 100âµM and titrated with 5-µl injections of 8 twofold serial dilutions of 500âmM sodium citrate (adjusted to pH 8.0 in 50âmM Tris, 200âmM NaCl buffer). The absorbance at 440ânm was recorded 5âmin after each addition. The data were fitted to a hyperbolic curve, yielding an apparent Kd of 9.0âmM for citrate. Given that citrate has a Kd of ~10â17âM for iron(iii), this implies that SAR11_1238 has a Kd for iron(iii) on the order of ~10â19âM, similar to previously characterized iron(iii)-binding proteins70,71.
X-ray crystallography
For the SAR11_0769/d-glucose and SAR11_1210/l-arginine structures, the proteins were first expressed and purified by nickel affinity chromatography under native conditions as described above. After addition of a 20-fold molar excess of d-glucose (SAR11_0769) or l-arginine (SAR11_1210), the protein was purified further by size-exclusion chromatography on a HiLoad 26/600 Superdex 75âpg column (Cytiva), eluting in 3à crystallization buffer (60âmM HEPES, 150âmM NaCl, pH 7.5). Fractions containing the target protein were collected, and d-glucose (SAR11_0769) or l-arginine (SAR11_1210) was added to a concentration of 30âµM. The protein was concentrated to a volume of ~500âµl, diluted threefold in water to reduce the NaCl concentration to 50âmM, and then concentrated further to 12âmgâmlâ1. For the SAR11_0769/d-galactose and SAR11_0655/l-pyroglutamate structures, the proteins were expressed and purified in the same way, except that no ligands were added. Protein crystals were obtained using the vapour diffusion method in hanging drops at 20â°C, then cryoprotected and flash-frozen in liquid nitrogen. Crystallization and cryoprotection conditions for each protein are given in Supplementary Methods. X-ray diffraction data were collected on beamline BL32XU at the SPring-8 synchrotron (Harima, Japan), using the ZOO suite for automated data collection72. The data were automatically indexed, integrated, scaled and merged in XDS73 using KAMO74. The structure was solved by molecular replacement in Phaser75 or MOLREP76. For SAR11_1210, the structure of an opine-binding protein from Agrobacterium fabrum (PDB ID 5OT8) was used as a search model; in the remaining cases, an AlphaFold2 model was used77. The structures were then refined by iterative real-space and reciprocal-space refinement in REFMAC78, Phenix79, and COOT80. Data collection and refinement statistics are given in Supplementary Table 10 and Supplementary Table 11. Structures were visualized in Pymol.
Gas chromatographyâmass spectrometry
SBPs purified under native conditions were exchanged into 200âmM ammonium acetate using a PD-10 desalting column (Cytiva) and concentrated to ~1âmM. A 10-nmol aliquot of protein was mixed with 10âµl of 300âµM α-methylglucopyranoside (as an internal control) and 200âµl methanol. The mixture was agitated at 1500ârpm at 24â°C for 10âmin and then centrifuged at 21,000g for 20âmin at 4â°C. The supernatant was evaporated to dryness using a vacuum evaporator, redissolved in 20âµl anhydrous pyridine, and derivatized by addition of 30âµl N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) containing 1% trimethylchlorosilane (Supelco) followed by incubation at 70â°C for 1âh. In the case of SAR11_1361, the dried sample was instead dissolved in 20âµl of 20âmgâmlâ1 methoxyamine hydrochloride in anhydrous pyridine and incubated at 37â°C for 90âmin with agitation at 750ârpm before addition of the MSTFA mixture. The derivatized samples were injected immediately onto an Agilent 7890âA GC System (Agilent Technologies) equipped with a PAL COMBI-XT autosampler (CTC Analytics) and connected to a PEGASUS 4D GCÃGC TOF-MS instrument (LECO) operating in one-dimensional mode. The GC was fitted with a DB-1MS column (Agilent Technologies) with 30âm length, 0.25âmm internal diameter, and 0.25âµm film thickness. The instrument was operated in pulsed split mode with a split ratio of 2 and injection volume of 1âµl. The inlet temperature was 250â°C. Helium was used as the carrier gas with a flow rate of 1âmlâminâ1. The GC oven temperature was held at 70â°C for 5âmin, then raised at 12â°Câminâ1 to 300â°C, and finally held at 300â°C for 10âmin. Mass spectrometry data were collected from 50 to 500âm/z after a 6.5-min solvent delay. The ion source and transfer line temperatures were 250â°C and the ionization energy was 70âeV. Data analysis and spectral database searches against the NIST database were performed using ChromaTOF software (LECO). Protein-derived samples were analysed before control samples to prevent carryover.
Biogeographical analysis
Biogeographical analysis was performed using the Ocean Gene Atlas v2.0 server33. Abundance data for each SBP gene from Ca. P. ubique HTCC1062 in the Tara Oceans OM-RGC_v2_metaG and OM-RGC_v2_metaT datasets was obtained through a BLAST search with a stringent e-value threshold of 10â30. To avoid inclusion of homologous SBPs with different transport functions, hits with a sequence identity of less than 40% (for ABC SBPs) or 55% (for TRAP and TTT SBPs) compared with the corresponding HTCC1062 SBP were excluded from the analysis.
To estimate the total abundance of SBP transcripts, abundance data for each of the 38 PFAM families in CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like), excluding the transferrin family (PF00405) and any families that contain solely enzymes or transcription factors (PF00800, PF01379, PF01634, PF02621, PF03466, PF09084), were obtained using a hmmer search of the OM-RGC_v2_metaT dataset with an e-value threshold of 10â10. Hits were obtained for 26 out of 31 PFAM families. For each PFAM family, the corresponding hidden Markov model (HMM) was obtained from the InterPro database81. The protein sequences from the hmmer search were then aligned to this HMM using hmmalign and used to construct a new HMM using hmmbuild in HMMER3.4 (http://hmmer.org). A second hmmer search of the OM-RGC_v2_metaT dataset, with a lower e-value threshold of 10â5, was then conducted using the resulting HMM. The hits from all 52 searches were combined and redundant hits were removed, resulting in a total of 211,222 unique SBP genes. The two-step search recovered 94% of the 23,879 genes identified as homologues of the Ca. P. ubique HTCC1062 SBPs in the BLAST analysis before application of a sequence identity threshold; the remaining 1267 genes were also added to the list of SBP genes. Finally, the total abundance of SBP genes at each site was calculated.
To estimate the percentage of SAR11 bacteria at a site containing a given SBP from Ca. P. ubique HTCC1062, we used the recruitment values of 159 SAR11 genomes in the Tara Ocean metagenome dataset calculated by Haro-Moreno et al.34. The presence of a homologue of each SBP in each of the corresponding genomes was determined by BLAST using a 50% sequence identity and 50% coverage threshold. The relative abundance of SAR11 bacteria containing a given SBP homologue was then calculated for each station. Plots were generated using R and GraphPad Prism.
Phylogenetic analysis
Protein sequences homologous to the SBP of interest were identified via a BLAST search of the UniProtKB Reference Proteomes and Swiss-Prot databases82. The resulting sequences were filtered to remove a small number of unusually long sequences (>20% greater than mean length) and aligned in MUSCLE v3.8.3183. The alignment was trimmed in trimAl v1.2 using the automated1 option84 and then used to generate a maximum-likelihood phylogeny in FastTree v2.1.11, using LGâ+âÎ20 as the substitution model85. For each protein sequence in the tree, the fraction of conserved binding site residues, compared with the corresponding protein from Ca. P. ubique HTCC1062, was estimated. The binding site residues were obtained from the crystal structure (SAR11_0769) or estimated from an AlphaFold2 model86,87. For this analysis, the following substitutions were treated as conservative: S/T, I/M, V/L, I/V, L/M, D/E, Q/N, A/V, F/Y, Y/W, F/W. Phylogenetic tree figures were generated using the ggtree package in R88. Figures showing taxonomic distribution (Extended Data Fig. 8b) were generated using Krona89.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.