Multi-pass, single-molecule nanopore reading of long protein strands

September 12, 2024

192

Expression and purification of proteins

Plasmids for analyte proteins were constructed using gBlocks (Integrated DNA Technologies) inserted into the pETâ49b(+) plasmid (Novagen), with a dihydrofolate reductase domain, a polyhistidine tag and a TEV cleavage site upstream of the sequence encoding an analyte protein. The NEBuilder HiFi DNA assembly and Q5 site-directed mutagenesis kits (New England Biolabs) were used for plasmid construction. Cloning was done using NEB 5-Î±-competent Escherichia coli cells. Plasmid sequences were verified by Sanger sequencing through Genewiz. Protein expression was induced overnight at 30âÂ°C with BL21 (DE3) E. coli cells in Overnight Express Instant TB medium (Novagen). Proteins were purified by immobilized metal affinity chromatography (IMAC) with TALON metal affinity cobalt resin and its associated buffer set (Takara), following the manufacturerâs instructions. Proteins were cleaved with TEV protease (New England Biolabs) and further purified by reverse IMAC. Purified proteins were concentrated using ultracentrifugal filters with a 10âkDa cutoff (Amicon) and stored in the short term at 4âÂ°C or in the long term at â80âÂ°C until use.

A covalently linked hexamer of an N-terminal truncated ClpX variant (ClpX-ÎN₆)⁶⁰ was prepared using the BLR E. coli strain as described previously⁴³. In brief, cells were grown to an optical density at 600 nm (OD₆₀₀) of around 0.6 in LB medium and then incubated in the presence of 0.5âmM isopropyl Î²-D-1-thiogalactopyranoside (IPTG) at 23âÂ°C for about 3âh to induce ClpX expression. ClpX was purified by IMAC and anion-exchange chromatography. Purified ClpX was stored at â80âÂ°C in small aliquots until use. ClpP expression was induced at an OD₆₀₀ of around 0.6 with 0.5âmM IPTG at 30âÂ°C for about 3âh⁴³. ClpP was purified by IMAC and stored at â80âÂ°C until use.

PTM assays

For asparagine deamidation, protein (around 1âmgâml^â1) was incubated overnight in 100âmM sodium bicarbonate buffer (pHâ9.6) at 25âÂ°C to catalyse deamidation. For protein phosphorylation with kinase, protein was incubated with either 50,000 units per ml PKA (New England Biolabs) or 10,000 units per ml CKII (New England Biolabs) in a protein kinase buffer (10âmM MgCl₂, 0.1âmM EDTA, 2âmM DTT, 0.01% Brij 35, 260âÂµM ATP and 50âmM Tris-HCl, pH 7.5) at 30âÂ°C. The protein solution was used for nanopore analysis immediately after the incubation without purification.

MinION experiments

All the experiments were done on the MinION platform using R9.4.1 flow cells. Run conditions were set with a custom MinKNOW script (available from Oxford Nanopore Technologies) at a temperature of 30âÂ°C and a constant voltage of â140âmV with a 3âkHz sampling frequency, except for initial proteins P1âP4, for which runs were performed at a constant voltage of â180âmV with a 10âkHz sampling frequency. Using the priming port, flow cells were first washed with 1âml cis running buffer (200âmM KCl, 5âmM MgCl₂, 10% glycerol and 25âmM HEPESâKOH, pHâ7.6) and then loaded with 200âÎ¼l protein analyte in cis running buffer at a final concentration of 500ânM, unless otherwise specified. Following the observation of protein captures in the pores, flow cells were washed with 1âml cis running buffer to remove uncaptured proteins and subsequently loaded with 75âÎ¼l cis running buffer supplemented with 4âmM ATP and 200ânM ClpX-ÎN₆ unless otherwise specified. The flow cell was washed about 4âmin after analyte loading in the initial method, and around 6âmin and 2âmin after analyte loading at concentrations of 5ânM and 500ânM, respectively, in the optimized method (Extended Data Fig. 10a). For MinION runs in the high-salt condition (Extended Data Fig. 6b), a buffer containing 400âmM KCl, 5âmM MgCl₂ and 25âmM HEPESâKOH (pHâ7.6) was used instead of standard cis running buffer to see if it would improve the signal-to-noise ratio.

Bulk degradation assays

The time-course degradation assay of the PASTOR-HDKER protein was performed in cis running buffer with 6âÎ¼M PASTOR-HDKER, 150ânM ClpX-ÎN₆, 300ânM ClpP₁₄ and an ATP-regeneration mix (4âmM ATP, 16âmM creatine phosphate and 7 units per ml creatine phosphokinase) at 30âÂ°C. Incubation was stopped by denaturing samples in Laemmli buffer at 95âÂ°C for 5âmin. Samples were run on SDSâPAGE and stained with Coomassie blue to quantify the protein bands using the ImageJ software.

Nanopore signal analysis

Preprocessing

To help identify ClpX-mediated protein translocations, we established detection thresholds using specific statistical parameters (standard deviation, median value, standard deviation of the mean of windows, and the ratio of values relative to the open pore value) indicative of translocation to ionic current blockades preceding a return to the open channelÂ state. This analysis was used to assist the process of manually checking traces for translocations, and translocations with particularly high noise or disruptions were discarded. PASTOR proteins were auto-segmented as described below, with the exception of those containing folded domains and PASTOR-rereads, which were segmented manually. PASTOR-reread rereads with a complete Y₂âY₃âY₄âY₅âY₂ signal were assumed to be full-length reads with a back-slipping distance of 310 amino acids. Partial rereads missing the signal(s) of the C-terminal Y₂, Y₃, Y₄ and Y₅ were assigned to have back-slipping distances of 250, 188, 125 and 61 amino acids, respectively. All figures with raw traces (those shown in pA) had a low-pass Bessel filter applied using SciPy with Nâ=â10 and W_nâ=â0.025, except for those showing stepping analysis (Figs. 2c and 6c, Extended Data Fig. 3 and Supplementary Figs. 5 and 6), which had W_nâ=â0.7. Before use in data analysis, traces were smoothed by applying a low-pass Bessel filter with Nâ=â10 and W_nâ=â0.03 with SciPy, and by applying average downsampling by a factor of 50 for proteins P1â4, 20 for the 8 PASTORs and 10 for the other proteins. Then, to scale, the segment was split into tenths, and the median of the minima of each tenth and the median of the maxima of each tenth were used as the min and max, respectively, to perform minâmax scaling (Extended Data Fig. 2b). For PASTOR-phos, the signals were iteratively scaled. We first used this approach, then DTW-aligned traces to two canonical presegmented traces and selected the alignment with the lowest DTW distance. The max value of the N-terminal VR was multiplied by 1.4, and the max value of VR GLSARRL was multiplied by 1.2, and the minimal max was used as the max value for minâmax scaling. This was repeated after realigning to the canonical traces and segmenting the VRs. Unless otherwise specified, ânormalizedâ refers to z-score normalization, as in ânormalized currentâ when comparing a model signal with experimental signals.

Signal alignment

To align signals, we used DTW⁶¹ and normalized the DTW distances by dividing by the sum of the lengths of the two signals. To describe the similarity of a set of traces, we computed the DTW distance between all pairs of traces. In t-distributed stochastic neighbor embeddingÂ (t-SNE) plots, we then clustered traces on the vector of its DTW distances to all other traces. To create ensemble traces, we first identified the trace with the lowest mean DTW distance to all other traces and stretched it to create T_medoidâ=â[t_1, t₂,.., t_n], where n is the mean length of all traces. We then DTW-aligned every other trace to T_medoid and created T_consensusâ=â[median(alignments to t₁), median(alignments to t₂), â¦, median(alignments to t_n)]. Ensemble traces in Fig. 1c, Fig. 5b and Extended Data Fig. 9d show all traces aligned to the T_consensus, but do not plot T_consensus.

Protein sequence-to-signal model

To describe the amino acids, we used their volumes⁶² and their charges at pH 7.6, at which the histidine residue is assumed to be neutral. The volume of phosphoserine was estimated as 126.6 cm³âmol^â1, on the basis of a linear regression of molecular weight to volume of the other residues. The model signal, S = [S₁, S₂, …, S_nâ19], of amino acid sequence [aa₁, aa₂, â¦, aa_n] is calculated by computing the signal for each of the nâ19 windows of width 20 (Extended Data Fig. 5aâd). The vector X_i describes the window starting at index i in the sequence. The j-th index in X_i is 1â+âV_câÃâvolume(aa_i+j)â+âP_câÃâPositiveCharge(aa_i+j)â+âN_câÃâNegativeCharge(aa_i+j), for 0ââ¤âjâ<â20, where the functions PositiveCharge and NegativeCharge take 1 if the residue has a positive or negative charge, respectively, and 0 otherwise.Â The constants representingÂ weights between charge and volume, V_câ=ââ3.9âÃâ10^â3, N_câ=â4.08âÃâ10^â1 and P_câ=ââ8.16âÃâ10^â2, were determined empirically to minimize the average post-DTW distance of a training subset of protein traces to the model of their sequences. To weight the values in X_i, we use a vector PW (parabolic weight)Â of length 20 containing values representing a negative, centrally positioned parabolic curve. The i-th index in S is then finallyÂ computed as the dot product of X_i and PW.

ClpX step identification

For this analysis, the signals were not scaled or downsampled. They were filtered with a low-pass Bessel filter with Nâ=â10 and W_nâ=â0.7. For this analysis, YY dips were extracted manually, including portions of the signal that would otherwise be considered part of the VR in this study, to best capture the entire portion for which the double tyrosines contribute to the signal. The number of residues per YY dip was calculated as pw/d, where p is the mean proportion of the total translocation dwell time spent in these regions (0.318; Extended Data Fig. 3a), w is the total number of reading windows in the sequence (359; Extended Data Fig. 1) and d is the number of YY dips per read (6). We primarily used a Bayesian-based algorithm⁶³ to identify steps, unless otherwise noted. When applying this algorithm, a minimum length of 10 observations and a threshold of 18 was used. A total of 776 YY-dip regions were analysed, comprising 45% of all the YY dips in the dataset, omitting dips affected by potential backstepping (non-monotonic steps) or excessive noise. This selection was made by excluding YY dips that did not follow the pattern of the mean of each segmented step monotonically decreasing to the minimum and then monotonically increasing. A secondary t-test-based algorithm⁶⁴ was also used to confirm the results of the stepping rate, which was used in a different study of ClpX stepping behaviour⁶⁵. When using the t-test-based algorithm, a minimum window length of 10 observations and a threshold P-value of 5âÃâ10^â5 were used, and a total of 456 dips were analysed.

YYÂ segmentation

To identify the YY dips and VRs, a single PASTOR trace was segmented manually into each coloured section in Fig. 2a, and the remainder of the traces were aligned to it with DTW. The corresponding regions were assigned the label from the one manually segmented trace (Supplementary Fig. 4). For PASTOR-phos, two canonical traces were segmented manually, and the rest of the traces were aligned to both, and then labels were assigned according to the canonical trace with the lowest DTW distance.

VR classification

We used scikit-learn to develop and test classical machineÂ learning models and Pytorch to develop and test convolutional neural-network models. The test set was composed of all current traces from a given set of experiments to create an out-of-sample test set. The set of test experiments was selected using linear programming (Python package Pulp) to ensure at least 12 VRs with each amino acid in the test set, and minimizing the test set size. We decided to use 12 because it gave the closest to an 80â20 trainâtest split: 79.6% of the VRs were in the training set and 20.4% were in the testing set (full counts are shown in Extended Data Table 1a). In classification tasks for which only VRs corresponding to a subset of amino acids were used, the test set was composed of a subset of this test set. We performed hyperparameter tuning with scikit-optimize on the training set using 5-fold cross-validation. The optimal parameters were: n_estimatorsâ=â250, min_samples_leafâ=â2, max_featuresâ=ââlog2â, max_depthâ=â20, ccp_alphaâ=â0.0001, class_weightâ=ââbalanced_subsampleâ and criterionâ=ââginiâ. All the results in Fig. 3b,c, Extended Data Fig. 6, Extended Data Table 2 and Supplementary Fig. 9 are from models evaluated on the test set. All the VRs containing an asparagine with a maximum transformed value above 1.3 had their labels changed to aspartate. In training all classical models, we upsampled minority classes, such that there was an equal representation of all classes in the training set. When training the convolutional neural network (CNN) in Extended Data Fig. 6c, we weighted the loss inversely proportional to each labelâs class representation in the training set. To featurize the VRs, we performed principal component analysis on the vector of its DTW distances to all VRs in the training set to reduce the size of the vector to 64. We also used the median, max, middle, mean, dip, mean absolute value of the derivative and median absolute value of the derivative of the transformed signals, as well as the standard deviation of the raw (unfiltered, unscaled) signal. The CNN had the transformed signal as input. It was trained with a stochastic gradient descent optimizer with a learning rate of 0.01, had four convolutional layers followed by a gated recurrent unit (GRU) and then a fully connected layer, and was initialized with Kaiming initialization. Max pooling and a ReLU activation function were applied after each convolutional layer. The dummy classifier was implemented with the scikit-learn dummy classifier with default parameters.

Reread simulation

To collect the results shown in Extended Data Fig. 7d,e, we used a random forest without hyperparameter tuning and used 100 randomly selected 80â20 trainâtrain splits. This was necessary to estimate the accuracy well enough with a large number of rereads, given the data limitation and the need to group samples in the test set.

Barcode error correction

To calculate the accuracy of barcode identification when using linear error-correcting codes, we started with our accuracy, p_VR, of identifying a VR given an alphabet size, a, of 2, 4, 8 or 16. For a given a and number of VRs, L, we calculated the number of bits, nâ=âLâÃâlog₂(a), that could be encoded in a protein. We simulated the accuracy with error correction, pâ², when nâk of the bits were allocated to linear error-correcting codes, for all integers kâ=â1 to n. We did this by conducting 50,000 trials of: first, encoding a random integer from 0 to 2^k with a generating matrix into a message of n bits; second, randomly and independently, with probability p_VR, changing each of the n/log₂(a) consecutive sets of log₂(a) bits in the encoded message (to a different set of bits of the same length) to simulate misclassifying one VR; and third, decoding the number with syndrome decoding. We calculated pâ² to be the percentage of trials in which the decoded number was the same as the original random number.

Phosphorylation detection

Each section (C-terminal linker, VR V, VR GLSARRL, VR A and N-terminal linker) was extracted with YY segmentation. For each section, the transformed current was aligned to the model of all possible phosphorylation states, shown in Supplementary Fig. 12. We determined the number of phosphorylations in each section by the number of phosphorylations in the best-matching (lowest DTW distance) phosphorylation-state model (Supplementary Table 2) to the actual trace. When describing the signal increase in VR GLSARRL caused by PKA (Extended Data Fig. 8a), only the portion of the section up to the (n/3)-th index, where n is the length of the YY-segmented VR GLSARRL, was used because that is where PKA causes the signal to increase, as seen in Fig. 6b.

Null-hypothesis tests

All PERMANOVA tests were done on the DTW distance matrix of signals using scikit-bio and 10⁶ permutations, unless we used a Bonferroni correction, in which case nâÃâ10⁶ permutations were used, where n is the number of comparisons performed. KruskalâWallis, T and MannâWhitney U tests were performed using SciPy. Reported PÂ values were multiplied by n if we noted that we used a Bonferroni correction. All tests were two-sided unless stated otherwise, and PÂ values were considered significant if Pâ<â0.05.

Materials availability

Protein expression plasmids are available at Addgene.

Reporting summary

Further information on research design is available in theÂ Nature Portfolio Reporting Summary linked to this article.

Multi-pass, single-molecule nanopore reading of long protein strands

Expression and purification of proteins

PTM assays

MinION experiments

Bulk degradation assays

Nanopore signal analysis

Preprocessing

Signal alignment

Protein sequence-to-signal model

ClpX step identification

YYÂ segmentation

VR classification

Reread simulation

Barcode error correction

Phosphorylation detection

Null-hypothesis tests

Materials availability

Reporting summary

First-ever ‘nuclear’ clocks put atomic clocks in the shade

Making samples one billion times bigger lets simple microscopes pinpoint amino acids

How should I respond to race-based exclusion in my lab?

Most Popular

DJ Plead: Please Album Review

Sarah Burton Has Unveiled Her First Men’s Campaign for Givenchy

How to invest when everything is moving too fast

Converge: Hum of Hurt Album Review

Recent Comments

ABOUT US

POPULAR POSTS

DJ Plead: Please Album Review

Sarah Burton Has Unveiled Her First Men’s Campaign for Givenchy

How to invest when everything is moving too fast

POPULAR CATEGORY