The genetic architecture of protein stability

September 27, 2024

120

Combinatorial mutagenesis library designs

Combinatorial library 1

Library 1 was designed using a computationally efficient greedy strategy to search for the largest number of single aa substitutions that, when combined, preserve both fold and function even in the highest-order mutants (Fig. 1b). The algorithm used previously published ddPCA data and thermodynamic modelling results for GRB2-SH3, including inferred single aa substitution free energy changes of folding and binding for this protein²³. We showed previously that this modelâwhich assumes that individual inferred folding and binding free energy changes (ÎÎG_f and ÎÎG_b) combine additively in multi-mutantsâaccurately predicts the effects of double aa substitutions²³. Therefore, this same additive model was used to make predictions about the energetic and phenotypic effects of higher-order mutants explored in the greedy search.

First, the set of candidate single aa mutations was restricted to those with confident free energy changes, defined as those with 95% confidence intervalsâ<â1âkcalâmol^â1 and whose effects were measured in at least 20 genetic backgrounds (that is, double aa mutations). Candidate mutations were further restricted to those reachable by single-nucleotide substitutions in the wild-type sequence to simplify synthesis of the resulting combinatorial mutagenesis library. The algorithm begins from an arbitrary starting mutation and iteratively selects further mutations at other residue positions until all residues in the protein have been mutated. The heuristic works by selecting further mutations at each step that maximize the fold and function of the current highest-order mutant combination, that is, the geometric mean of model-predicted AbundancePCA and BindingPCA growth rates. This procedure is then repeated for all possible starting mutations.

To visualize and compare the resulting solutions, we also simulated the median AbundancePCA and BindingPCA growth rates of all candidate combinatorial libraries, calculated using a random sample of 10,000 variants. Although the algorithm is not guaranteed to produce the optimal solution at each Hamming distance from the wild-type sequence, the greedy approach nevertheless achieves solutions in which both phenotypes are predicted to be preserved in variants with more than 30 mutations (Extended Data Fig. 1b), beyond which one or both phenotypes are lost. Defining viable libraries as those preserving both molecular phenotypes above 70% of the maximal value (that is, the geometric mean of simulated median AbundancePCA and BindingPCA growth rates) resulted in the largest candidate combinatorial library consisting of all combinations of 34 single aa mutations (Fig. 1 and Extended Data Fig. 1bâd).

Combinatorial library 2

We clustered the contact map (minimal side-chain heavy-atom distanceâ<â5âÃ) comprising all GRB2-SH3 surface residues (RSASAââ¥â0.25) existing in secondary structure elements (Extended Data Fig. 4) and selected the following four physically proximal residues for saturation combinatorial mutagenesis: H26, M28, A39 and T44 (see Extended Data Fig. 5).

Combinatorial library 3

This library was designed to include all combinations of 15 single aa substitutions with mild effects (within one-third of the AbundancePCA fitness interquartile range of the wild type²³) in close proximity in the primary sequence and reachable by single-nucleotide substitutions while avoiding mutations in binding interface residues (minimal side-chain heavy-atom distance to the ligandâ<â5âÃ). We used a sliding window approach to determine the number of candidate mutant residues in stretches of 20, 21 and 22 consecutive residues in GRB2-SH3 (Extended Data Fig. 4b). Only one window with a width of 22âaa (starting at residue position 10) includes 15 candidate positions (Extended Data Fig. 4b). The final library consisted of all combinations of the following randomly selected candidate mutations at these positions: D10N, P11A, D14N, G15E, G18C, R20S, R21Q, D23E, F24I, H26L, V27I, M28K, D29E, N30T and S31T (see Fig. 4).

Combinatorial library 4: SRC

This library was designed using the same greedy algorithm from data and thermodynamic modelling results for SRC⁵¹, including inferred single aa substitution free energy changes of folding and activity for this protein. The design includes 15 single aa substitutions reachable by single nt substitution in a 22âaa window, located in the N-lobe of the SRC kinase domain, avoiding mutations in the activation loop, subsetting folding and activity ddGs to confident energies (95% confidence intervalâ<â1âkcalâmol^â1) and associated with singles observed in at least seven backgrounds. The final library consisted of all combinations of the following randomly selected candidate mutations at these positions: V329G, G344S, F349V, K343M, E331K, V337A, E332A, M341K, S330N, I336L, T338S, S345T, L346V, P333T and Y340S (see Fig. 6).

Mutagenesis library construction and selection assays

Media and buffers used

LB: 10âgâl^â1 bacto-tryptone, 5âgâl^â1 yeast extract, 10âgâl^â1 NaCl. Autoclaved 20âmin at 120âÂ°C.
YPDA: 20âgâl^â1 glucose, 20âgâl^â1 peptone, 10âgâl^â1 yeast extract, 40âmgâl^â1 adenine sulphate. Autoclaved 20âmin at 120âÂ°C.
SORB: 1âM sorbitol, 100âmM LiOAc, 10âmM Tris pHâ8.0, 1âmM EDTA. Filter sterilized (0.2-mm nylon membrane, Thermo Scientific).
Plate mixture: 40% PEG3350, 100âmM LiOAc, 10âmM Tris-HCl pHâ8.0, 1âmM EDTA pHâ8.0. Filter sterilized.
Recovery medium: YPD (20âgâl^â1 glucose, 20âgâl^â1 peptone, 10âgâl^â1 yeast extract)â+â0.5âM sorbitol. Filter sterilized.
SC-URA: 6.7âgâl^â1 yeast nitrogen base without aa, 20âgâl^â1 glucose, 0.77âgâl^â1 complete supplement mixture drop-out without uracil. Filter sterilized.
SC-URA/MET/ADE: 6.7âgâl^â1 yeast nitrogen base without aa, 20âgâl^â1 glucose, 0.74âgâl^â1 complete supplement mixture drop-out without uracil, adenine and methionine. Filter sterilized.
Competition medium: SC-URA/MET/ADEâ+â200âÎ¼gâml^â1 methotrexate (Merck Life Science), 2% DMSO.
DNA extraction buffer: 2% Triton-X, 1% SDS, 100âmM NaCl, 10âmM Tris-HCl pHâ8.0, 1âmM EDTA pHâ8.0.

Plasmid construction

For libraries 1â3: GRB2 mutagenesis plasmid pGJJ286: wild-type GRB2-SH3 was digested from pGJJ046 (described previously²³) with the restriction enzymes AvrII and HindIII and cloned into the digested plasmid pGJJ191 (described previously²⁴) using T4 ligase (New England Biolabs). AbundancePCA pGJJ046 and pGJJ045 plasmids and BindingPCA pGJJ034 and pGJJ001 plasmids were previously described²³. For library 4: pTB043 plasmid containing full-length SRC was described previously⁵¹. pTB043 is based on the same backbone as the AbundancePCA plasmids. The difference is that full-length SRC is fused to the DHFR[3] fragment at its NÂ terminus and to the DHFR[1,2] fragment at its CÂ terminus, so DHFR is reconstituted following correct folding of SRC,Â whereas unfolded SRC genotypes result in degradation of the fusion protein.

Libraries construction

Libraries 1â3: libraries were constructed in two steps. First, an IDT primer containing the chosen combination of mutations was assembled by Gibson into the mutagenesis plasmid pGJJ286. Libraries were then cloned into the yeast plasmids AbundancePCA pGJJ045 and BindingPCA pGJJ001 by digestion/ligation. For the first step, the libraries into the mutagenesis plasmid were assembled by Gibson reaction (in-house preparation) of two fragments. The vector fragment was obtained by polymerase chain reaction (PCR) amplification of pGJJ286 with the oligos shown in Supplementary Tables 1 and 2, incubated with DpnI to remove the template and gel purified using QIAquick gel extraction kit (Qiagen). The insert fragment was obtained by mixing equimolar amounts of IDT mutation primer (Supplementary Tables 1 and 2) and a reverse elongation primer (Supplementary Tables 1 and 2) and incubating for one cycle of annealing/extension with Q5 polymerase (New England Biolabs). dsDNA product was then incubated with ExoSAP-IT (Applied Biosystems) to remove the remaining ssDNA and purified with MinElute columns (Qiagen). 100âng of vector in a molar ratio of 1:5 with the insert was incubated for 3âh at 50âÂ°C with a Gibson mix 2Ã prepared in-house. The reaction was desalted by dialysis with membrane filters (MF-Millipore) for 1âh and concentrated 4Ã using a SpeedVac concentrator (Thermo Scientific). DNA was then transformed into NEB 10-beta High Efficiency Electrocompetent E. coli. Cells were allowed to recover in SOC medium (NEB 10-beta Stable Outgrowth Medium) for 30âmin and later transferred to LB medium with spectinomycin overnight. A fraction of cells was also plated into spectinomycinâ+âLBâ+âagar plates to estimate the total number of transformants. 100âml of each saturated E. coli culture was collected the next morning to extract the mutagenesis plasmid library using the QIAfilter Plasmid Midi Kit (QIAGEN). To obtain the final libraries into the yeast plasmids, libraries in pGJJ286 plasmid were digested with NheI and HindIII, gel purified (MinElute Gel Extraction Kit, QIAGEN) and cloned into pGJJ045 or pGJJ034 digested plasmids with T4 ligase (New England Biolabs) by temperature-cycle ligation following the manufacturerâs instructions, 67âfmol of backbone and 200âfmol of insert in a 33.3-Î¼l reaction. The ligation was desalted by dialysis using membrane filters for 1âh, concentrated 4Ã using a SpeedVac concentrator (Thermo Scientific) and transformed into NEB 10-beta High Efficiency Electrocompetent E. coli cells.

Library 4: this library was constructed in one step by Gibson reaction of two fragments. The vector fragment was obtained by amplification of pTB043 plasmid with the oligos shown in the Supplementary Tables 1 and 2. The second fragment was obtained with ten cycles of PCR using mutated IDT primer as template (Supplementary Tables 1 and 2).

Methotrexate yeast selection assay

The yeast selection assay was previously described²³. The high-efficiency yeast transformation protocol described below (adjusted to a pre-culture of 200âml of YPDA) was scaled up or down, depending on the number of transformants for each library (Supplementary Table 2). Three independent pre-cultures of BY4742 were grown in 20âml of standard YPDA at 30âÂ°C overnight. The next morning, the cultures were diluted into 200âml of pre-warmed YPDA at an OD_600nmâ=â0.3 and incubated at 30âÂ°C for 4âh. Cells were then collected and centrifuged for 5âmin at 3,000g, washed with sterile water and SORB medium, resuspended in 8.6âml of SORB and incubated at room temperature for 30âmin. After incubation, 175âÎ¼l of 10âmgâml^â1 boiled salmon sperm DNA (Agilent Genomics) and 3.5âÎ¼g of plasmid library were added to each tube of cells and mixed gently. 35âml of plate mixture was added to each tube to be incubated at room temperature for a further 30âmin. 3.5âml of DMSO was added to each tube and the cells were then heat shocked at 42âÂ°C for 20âmin (inverting tubes from time to time to ensure homogenous heat transfer). After heat shock, cells were centrifuged and resuspended in approximately 50âml of recovery media and allowed to recover for 1âh at 30âÂ°C. Cells were then centrifuged, washed with SC-URA medium and resuspended in 200âml SC-URA. 10âÎ¼l was plated on SC-URA Petri dishes and incubated for about 48âh at 30âÂ°C to measure the transformation efficiency. The independent liquid cultures were grown at 30âÂ°C for about 48âh until saturation. Saturated cells were diluted again to OD_600nmâ=â0.1 in SC-URA/MET/ADE media and allowed to grow four generations until OD_600nmâ=â1.6 at 30âÂ°C and 200ârpm. A fraction of the culture was then used to inoculate 200âml of competition media containing methotrexate at a starting OD_600nmâ=â0.05 and the rest was collected and pellets frozen and stored as input. Cells in competition media were allowed to grow for 3â5 generations (Supplementary Table 2), collected and frozen and stored as output.

DNA extractions and plasmid quantification

The DNA extraction protocol used was previously described²³. The protocol below is for 100âml of collected culture at OD_600nmâââ1.6. Protocols were scaled up or down, depending on the library (Supplementary Table 2). Cell pellets (one for each experiment input/output replicate) were resuspended in 1âml of DNA extraction buffer, frozen by dry ice/ethanol bath and incubated at 62âÂ°C in a water bath twice. Subsequently, 1âml of phenol/chloro/isoamyl in a ratio of 25:24:1 (equilibrated in 10âmM Tris-HCl, 1âmM EDTA, pHâ8.0) was added, together with 1âg of acid-washed glass beads (Sigma Aldrich) and the samples were vortexed for 10âmin. Samples were centrifuged at room temperature for 30âmin at 4,000ârpm and the aqueous phase was transferred into new tubes. The same step was repeated twice. 0.1âml of NaOAc 3âM and 2.2âml of pre-chilled absolute ethanol were added to the aqueous phase. The samples were gently mixed and incubated at â20âÂ°C for at least 30âmin. After that, they were centrifuged for 30âmin at full speed at 4âÂ°C to precipitate the DNA. The ethanol was removed and the DNA pellet was allowed to dry overnight at room temperature. DNA pellets were resuspended in 0.6âml TE 1X and treated with 5âÎ¼l of RNase A (10âmgâml, Thermo Scientific) for 30âmin at 37âÂ°C. To desalt and concentrate the DNA solutions, the QIAEX II Gel Extraction Kit was used (50âÂµl of QIAEX II beads, QIAGEN). The samples were washed twice with PE buffer and eluted twice by 125âÂµl of 10âmM Tris-HCI buffer, pHâ8.5. Finally, plasmid concentrations in the total DNA extract (which also contained yeast genomic DNA) were quantified by quantitative PCR using the primer pair oGJJ152âoGJJ153 that binds to the ori region of the plasmids.

Sequencing library preparation

Libraries 1â3: this was shown in ref.â²³. Briefly, the sequencing libraries were constructed in two consecutive PCR assays. The first PCR (PCR1) was designed to amplify the mutated protein of interest and to increase the nucleotide complexity of the first sequenced bases by introducing frame-shift bases between the adapters and the sequencing region of interest (Supplementary Tables 1 and 2). The second PCR (PCR2) was necessary to add the remainder of the Illumina adapter and demultiplexing indexes. PCR2 reactions were run for each sample independently using Hot Start High-Fidelity DNA Polymerase. In this second PCR, the remaining parts of the Illumina adapters were added to the library amplicon. The forward primer (5â² P5 Illumina adapter) was the same for all samples (GJJ_1J), whereas the reverse primer (3â² P7 Illumina adapter) differed by the barcode index (Supplementary Table 3) to be subsequently pooled and demultiplexed after deep sequencing. All samples were pooled in an equimolar ratio and gel purified using the QIAEX II Gel Extraction Kit. The purified amplicon library pools were subjected to Illumina 150-bp paired-end NextSeq500 sequencing at the CRG Genomics Core Facility.

Library 4: the method for preparing the library for sequencing was the same as for the other libraries but in the second PCR step, we used a barcoded index in the forward primer as well (5â² P5 Illumina adapter). The purified amplicon library pool was sequenced with an Illumina paired-end NextSeq2000 machine this time.

Sequencing data processing

FastQ files from paired-end sequencing of all AbundancePCA and BindingPCA experiments were processed with DiMSum v1.3 (ref.â⁵²) using default settings with small adjustments (https://github.com/lehner-lab/DiMSum). Supplementary Table 4 contains DiMSum fitness estimates and associated errors for all experiments. Experimental design files and command-line options required for running DiMSum on these datasets are available on GitHub (https://github.com/lehner-lab/archstabms). Variants with fewer than ten input read counts in any replicate were discarded (âfitnessMinInputCountAllâ option), that is, only variants observed in all replicates above this threshold were retained. For library 1, we also included fitness estimates that derived from a subset of replicates whose input read counts exceeded this threshold (âfitnessMinInputCountAnyâ option; see Fig. 1).

For library 1, we also included a wild-type-only sample for sequencing using pGJJ046 as template to derive empirical estimates of sequencing error rates. The FastQ file for this sample was processed identically to those of the replicate input/output samples in the first-pass analysis with DiMSum with permissive base quality thresholds (âvsearchMinQual = 5â and âvsearchMaxee = 1000â). Read counts for all variants were then adjusted by subtracting the expected number of sequencing errors derived from the wild-type-only sample and proportional to the total sequencing library size of each sample. Finally, fitness estimates and associated errors for library 1 were then obtained from the resulting corrected variant counts with DiMSum (âcountPathâ option).

Thermodynamic modelling with MoCHI

We used MoCHI⁴³ (https://github.com/lehner-lab/MoCHI) to fit all thermodynamic models to combinatorial DMS data using default settings with small adjustments. The software is based on our previously described genotypeâphenotype modelling approach²³, with extra functionality and improvements for ease of use and flexibility^24,43. Models fit to shallow (double-mutant) libraries and used in the analyses described in this work (for example, combinatorial mutagenesis library designs) were obtained using the original software implementation²³.

We model protein folding as an equilibrium between two states: unfolded (u) and folded (f), and protein binding as an equilibrium between three states: unfolded and unbound (uu), folded and unbound (fu) and folded and bound (fb). We assume that the probability of the unfolded and bound state (ub) is negligible and free energy changes of folding and binding are additive, that is, the total binding and folding free energy changes of an arbitrary variant relative to the wild-type sequence is simply the sum over residue-specific energies corresponding to all constituent single aa substitutions.

We configured MoCHI parameters to specify a neural network architecture consisting of additive trait layers (free energies) for each biophysical trait to be inferred (folding or folding and binding for AbundancePCA or BindingPCA, respectively), as well as one linear transformation layer per observed phenotype. The specified nonlinear transformations âTwoStateFractionFoldedâ and âThreeStateFractionBoundâ derived from the Boltzmann distribution function relate energies to proportions of folded and bound molecules, respectively (see Figs. 2a and 4e,f). The target (output) data to fit the neural network comprise fitness scores for the wild-type and aa substitution variants of all mutation orders. The inclusion of both first-order and second-order (pairwise energetic coupling) model coefficients in the models was specified using the âmax_interaction_orderâ option.

A random 30% of aa substitution variants of all mutation orders was held out during model training, with 20% representing the validation data and 10% representing the test data. Validation data were used to evaluate training progress and optimize hyperparameters (batch size). Optimal hyperparameters were defined as those resulting in the smallest validation loss after 100 training epochs. Test data were used to assess final model performance.

MoCHI optimizes the parameters Î¸ of the neural network using stochastic gradient descent on a loss function ${\mathcal{L}}[\theta ]$ based on a weighted and regularized form of mean absolute error:

$${\mathcal{L}}[\theta ]=1/N\mathop{\sum }\limits_{n=0}^{N-1}\left|{y}_{n}-{\widehat{y}}_{n}\right|{\sigma }_{n}^{-1}+{\lambda }_{2}{\parallel \theta \parallel }^{2}$$

in which y_n and Ï_n are the observed fitness score and associated standard error, respectively, for variant n, Å·_n is the predicted fitness score, N is the batch size and Î»₂ is the L₂ regularization penalty. To penalize very large free energy changes (typically associated with extreme fitness scores), we set Î»₂ to 10^â6, representing light regularization. The mean absolute error is weighted by the inverse of the fitness error (${\sigma }_{n}^{-1}$) to downweight the contribution of less confidently estimated fitness scores to the loss. Furthermore, to capture the uncertainty in fitness estimates, the training data were replaced with a random sample from the fitness error distribution of each variant. The validation and test data were left unaltered.

Models were trained with default settings, that is, for a maximum of 1,000 epochs using the Adam optimization algorithm with an initial learning rate of 0.05 (except for library 1, for which we used an initial learning rate of 0.005). MoCHI reduces the learning rate exponentially (Î³â=â0.98) if the validation loss has not improved in the most recent ten epochs compared with the preceding ten epochs. Also, MoCHI stops model training early if the wild-type free energy terms over the most recent ten epochs have stabilized (standard deviationââ¤â10^â3).

Free energies are calculated directly from model parameters as follows: ÎG_bâ=âÎ¸_bRT and ÎG_fâ=âÎ¸_fRT, in which Tâ=â303âK and Râ= 0.001987âkcalâK^â1âmol^â1. We estimated the confidence intervals of model-inferred free energies using a Monte Carlo simulation approach. The variability of inferred free energy changes was calculated between ten separate models fit using data from: (1) independent random trainingâvalidationâtest splits and (2) independent random samples of fitness estimates from their underlying error distributions. Confident inferred free energy changes are defined as those with Monte Carlo simulation-derived 95% confidence intervalsâ<â1âkcalâmol^â1. Supplementary Table 5 contains inferred binding and folding free energy changes and energetic couplings from all second-order models.

Linear model to predict energetic coupling strength

We built a linear model to predict energetic coupling strength (absolute value of energetic coupling terms) from 12 features (see Fig. 3e), comprising five distance metrics for residue pairs or positions thereof in the protein structure: backbone distance (linear 1D distance separating residue pairs along the primary aa sequence), inter-residue distance (minimal side-chain heavy-atom distance in 3D space), number of core residues (0, 1 or both residues in the pair with RSASAâ<â0.25), number of binding interface residues (0, 1 or both with minimal side-chain heavy-atom distance to the ligandâ<â5âÃ), number of beta-sheet residues (0, 1 or both in beta strands) and seven features describing the number of chemical bonds or interactions between the atoms of pairs of residues as calculated using the GetContacts software tool (https://getcontacts.github.io/): backbone to backbone hydrogen bonds, side chain to backbone hydrogen bonds, side chain to side chain hydrogen bonds, piâcation interactions, pi-stacking interactions, salt-bridge interactions and van der Waals interactions. Before running GetContacts, we used PyMOL to fill missing hydrogens (âh_addâ command), FoldX⁵³ to restore the wild-type proline at position 54 that is mutated in the reference crystal structure (PDB: 2VWF; âPositionScanâ command) and removed GAB2 ligand atoms. The training dataset comprised energetic couplings inferred from library 1 and the test set comprised independently inferred energetic couplings from library 3 (see Fig. 3f).

Reporting summary

Further information on research design is available in theÂ Nature Portfolio Reporting Summary linked to this article.

The genetic architecture of protein stability

Combinatorial mutagenesis library designs

Combinatorial library 1

Combinatorial library 2

Combinatorial library 3

Combinatorial library 4: SRC

Mutagenesis library construction and selection assays

Media and buffers used

Plasmid construction

Libraries construction

Methotrexate yeast selection assay

DNA extractions and plasmid quantification

Sequencing library preparation

Sequencing data processing

Thermodynamic modelling with MoCHI

Linear model to predict energetic coupling strength

Reporting summary

Transmission of MPXV from fire-footed rope squirrels to sooty mangabeys

CSN5i-3 is an orthosteric molecular glue inhibitor of COP9 signalosome

Trump team’s new rule could make firing government scientists easier

Most Popular

Donna Kelce Says She Doesn’t Have to Sign NDA for Travis & Taylor Swift Wedding

Glossier Undergoes Restructuring, Cutting 30 Percent of Its Workforce

US FTC chair urges Tim Cook to review Apple News’ terms and curation after a study claimed the app promoted “leftist outlets” and suppressed...

Transmission of MPXV from fire-footed rope squirrels to sooty mangabeys

Recent Comments

ABOUT US

POPULAR POSTS

Donna Kelce Says She Doesn’t Have to Sign NDA for Travis & Taylor Swift Wedding

Glossier Undergoes Restructuring, Cutting 30 Percent of Its Workforce

US FTC chair urges Tim Cook to review Apple News’ terms and curation after a study claimed the app promoted “leftist outlets” and suppressed...

POPULAR CATEGORY