Collective homeostasis of condensation-prone proteins via their mRNAs

GeRM algorithm

GeRM is calculated from a string of consecutive overlapping nucleotide sequences of length k (k-mers). From the set of n consecutive k-mers A, the GeRM score g of the central k-mer A₀ is calculated as follows:

$$g=\sum _i\in Be^-\lambda d(A_,A_i)\cdot \fracw$$

where B = −w,…,−k,k,…,w

where $w=\fracn+12$, the function d represents the Hamming distance, and λ represents a scaling constant.

In non-mathematical terms, the GeRM score is calculated by comparing a k-mer to all the other k-mers that surround it in a fixed window. For each of the surrounding k-mers, the sequence similarity to the central k-mer is calculated from the negative exponent of the Hamming distance, such that k-mers with identical sequences have a high score and those with unrelated sequences have a low score. The constant λ determines how quickly this similarity score decays as sequences become more dissimilar to the central k-mer. This sequence similarity score is multiplied by a distance score, which decays linearly from 1 to 0 with distance from the central k-mer. k-mers that overlap with the central k-mer are ignored. For k-mers at the edges of transcripts, where the window exceeds the end of the transcript, all positions that fall outside of the transcript are given a score of 0. The sum of all the distance-weighted sequence similarities is summed to give the GeRM score.

In this Article, we used a k-mer length (k) of 5, a window size (n) of 123 and a scaling factor (λ) of 1. We normalized the GeRM scores such that the minimum value in the transcriptome was 0 and the median was 1. Smoothing of GeRM scores was performed by taking the mean of values in a sliding window of size 123.

For the calculation of codon-shuffled multivalency, all codons for a given amino acid in each transcript were swapped such that the number of instances of each codon was preserved for each transcript, but their order was randomized. All GeRM scores for each transcript were calculated, and the mean GeRM score per position across ten shuffles was calculated.

The GeRM algorithm is available as an R package from GitHub (https://github.com/ulelab/germ).

Identification and classification of GeRM regions

For each protein-coding gene in GENCODE 29 or GENCODE M22, the transcript with the longest CDS was selected and ties were broken by the longest total transcript length. Only spliced transcripts were used for analysis. CDS GeRM regions were defined by taking all positions where the smoothed GeRM scores exceeded the 98th percentile of smoothed GeRM scores within the CDS, and GeRM regions were adjusted to contain all k-mers that fell within the smoothing window. GeRM regions with at least 41 nt of overlap were merged. GeRM regions for which at least one-third of the region fell within an untranslated region were excluded.

For each GeRM region, the relative proportion of the total multivalency of the region that was accounted for by each possible k-mer was calculated. For clustering, UMAP was used to reduce the dimensionality of the data to 4 dimensions, using 50 neighbours and a minimum distance of 0.001. The reduced data were clustered first using OPTICS with the minimum points defined by 1% of the total number of GeRM regions. The final clustering was performed with DBSCAN, yielding eight clusters. The reachability threshold for DBSCAN was defined by determining how the proportion of points below a given threshold changed as the reachability threshold decreased (from the 99th percentile to 0), then taking the point at which the first derivative was 40% of the maximum (the knee of the plot). GeRM regions falling outside of a cluster were excluded from further analysis. For presentation purposes, UMAP was used to reduce the original data to have two dimensions, with other parameters kept the same.

Codon shuffling was performed by shuffling the codons either within the entire transcriptome or within each transcript. Shuffling was performed five times, and GeRM scores were calculated from the shuffled sequences. The mean GeRM score per k-mer across the five shuffles was used for further analysis.

To generalize the scoring of identified multivalent types (that is, the GA-multivalency score and transcript multivalency), we summarized each GeRM cluster with the k-mers that contributed most to multivalency of regions in that cluster. The proportion of total multivalency accounted for by each 5-mer was calculated, and 5-mers were sorted in descending order based on this proportion. The cumulative sum was calculated, and all 5-mers that accounted for the first 50% of the multivalency in each GeRM cluster were selected as representative 5-mers for this multivalency type.

Low complexity and disorder

To define the complexity of amino acid sequences, information entropy was calculated using the R package HDMD in a sliding window of 41 amino acids along the translated CDSs from the set of transcripts used for GeRM scoring. LCDs were defined as regions with entropy in the bottom 2% of all entropy scores, with all amino acids in the sliding window being considered part of the LCD. LCDs that overlap were merged. AlphaFold⁵⁹ predicted local distance difference test (pLDDT) scores were obtained for the same set of proteins where available from UniProt.

GeRM codon ratios

To determine whether a given position in a sequence supported the local multivalency of that sequence, we defined a GeRM codon ratio. First, ‘mutable positions’ were defined as any position that could be synonymously mutated. Primarily, these are the final position of the codon, but the first positions of some leucine and arginine codons can also be synonymously mutated. For each mutable position, sequences were produced in which every possible synonymous mutation was made. The GeRM scores of all 5-mers that overlapped with the mutable position were calculated, and the maximum of these GeRM scores was taken for each possible mutation or the native sequence. Then, the maximum GeRM score associated with the 5-mers overlapping the native sequence was divided by the mean of the maximum GeRM scores associated with all possible mutations to create a ratio. When this ratio is positive, the native nucleotide at the mutable position is associated with higher GeRM than the other possible mutations that could exist at that position on average.

Conservation of GeRM

PhyloP conservation across 100 vertebrates and 470 mammals in hg38 coordinates were downloaded from the UCSC Golden Path. These values were converted into transcriptomic coordinates based on the single transcripts per genes that were used for GeRM calculation. Mutable positions were binned based on their GeRM codon ratio with the binning calculations being applied to each codon separately, such that each bin contained an equal proportion of each codon.

For raw conservation analysis, the average PhyloP conservation values of mutable positions in each GeRm codon ratio bin were used. For analysis of normalized conservation values, the PhyloP of mutable positions were normalized for each codon by subtracting the median conservation of that mutable position in that codon by the median conservation across all mutable positions of the same type. For example, the conservation value associated with one instance of the A in the codon CGA would be normalized by subtracting the median conservation of all As in CGA codons. Each normalized position was then further normalized by subtracting the PhyloP value of the middle position of the codon (in the previous example, the G in that specific instance of the CGA codon). Subtraction was used for normalization because PhyloP scores operate on a logarithmic scale.

Arginine codon usage

For analysis of arginine codon usage in R-LCDs, all human LCDs were filtered for those that obtain at least 20% arginine residues. The proportion of each arginine codon in the respective coding regions was calculated. Pairwise correlations between arginine codon usages using Spearman’s rank correlation. To define CG-rich and GA-rich codon groups, principal component analysis was performed based on the arginine codon usage for all R-LCDs, and the top and bottom third of LCDs based on the first principal component were defined as CG-rich and GA-rich R-LCDs, respectively.

For the definition of R-LCDs in different species, all CDSs for each species were downloaded from Ensembl and a single transcript per gene was identified according to the previously described approach. LCDs encoded by these sequences were defined as previously described, using the complexity threshold defined for the human proteome for each species. A full list of the species and annotation versions can be found in Supplementary Table 1.

Gene Ontology analysis

Gene Ontology analyses were conducted using the topGO R package and using the weight01 algorithm to account for the topology of the Gene Ontology graph.

Cell culture

HeLa cells were obtained from Cell Science at The Francis Crick Institute and cultured in DMEM + GlutaMAX (Thermo Scientific) with 10% FBS and split every 3–4 days using TrypLE Express (Thermo Scientific). Low-passage, wild-type mouse embryonic stem cells (mESCs; 129S8/B6 background) were obtained from the Genetic Modification Service at The Francis Crick Institute and cultured feeder free on Nunc cell culture dishes (Thermo Scientific) coated with 0.1% gelatin (ES-006-B, Millipore). mESCs were maintained in 2i media⁶⁰, fed every day and split every 2–3 days using Accutase (A6964, Sigma). All cell lines were sourced from, authenticated by and mycoplasma tested at The Francis Crick Institute.

All cell transfections were performed using Lipofectamine 3000 (Thermo Scientific) according to manufacturer’s instructions. Doxycycline induction was performed by changing cell culture media to fresh, pre-warmed media containing 100 ng ml⁻¹ doxycycline. CLK-IN-T3 (SML2649, Sigma) was dissolved to 5 mM in DMSO. In experiments using CLK-IN-T3 treatments, control cells were treated with the equivalent volume of DMSO.

Plasmid construction

Double-stranded DNA fragments were ordered as gBlocks or eBlocks from IDT. Primers were ordered from Sigma-Aldrich. All PCRs utilized Q5 High-Fidelity Master Mix (NEB) or Phusion High-Fidelity Master Mix (NEB). Backbone plasmids were linearized via PCR followed by treatment with DpnI (NEB) before assembly with fragments using NEBuilder HiFi DNA Assembly (NEB) according to the manufacturer’s protocol. Phosphorylations and ligations were performed using T4 PNK kinase (NEB) and T4 DNA ligase (NEB). Assembly reactions were purified using magnetic SPRI beads (Mag-bind TotalPure NGS; Omega-Bio-tek) before DNA transformation into 5α competent Escherichia coli (NEB). All sequences were confirmed via Sanger sequencing (Source Bioscience). pTwist-AMP high-copy plasmid was purchased from Twist Bioscience and was used as a vector for all overexpression via transient transfection. Sequences of all LUC7L3 constructs can be found in Supplementary Table 5.

For the three additional R-MCD plasmids, the following regions of LUC7L3_221–415, PRPF38B_251–534 and SRSF11_171–389 were fused to mGreenLantern and assembled into pTwist-CMV plasmid backbones.

Subcellular fractionation

Nuclear–cytoplasmic fractionation was carried out by trypsinizing whole cells and washing twice with PBS. Cells were then pelleted and resuspended in 200 µl cytoplasmic lysis buffer: 50 mM HEPES pH 7.5, 2 mM MgCl₂, 50 mM 2-mercaptoethanol, 0.025% NP-40, 0.05% saponin and 1X cOmplete EDTA-free protease inhibitors (Sigma). Cells were rotated at 4 °C for 10 min, pelleted and the supernatant was taken as the cytoplasmic fraction. The nuclear pellet was washed in 1 ml cytoplasmic lysis buffer for 3 min, then pelleted again. The supernatant was removed and the nuclei were lysed in 200 µl iCLIP lysis buffer⁶¹ to obtain nuclear fractions. All centrifugations were performed at 4 °C and at 300g for 3 min. All buffers were kept ice cold. For each fraction, 20 µl was used for western blotting to confirm the efficiency of the fractionation, which was stable across all samples (Extended Data Fig. 4a, Supplementary Fig. 3). In all western blots, vinculin was used as a loading control (1:2,000; 700062, Thermo Scientific). Anti-phosphoepitope SR proteins (1:1,000; MABE50, Sigma) were used to detect SR phosphorylation changes.

RNA extraction

RNA extractions were performed from cell pellets using the Maxwell RSC simplyRNA kit (Promega) using a Maxwell RSC Instrument (Promega).

3′ End sequencing

For each sample, 250 ng µl⁻¹ of RNA was fragmented in 10 mM Tris-HCl pH 7.5 and 10 mM MgCl₂ buffer for 5 min at 95 °C. Of fragmented RNA, 2 µl was mixed with 2 µl water, 0.5 µl of 10 mM dNTPs and 0.5 µl of 5 µM of oligo-dT reverse transcription primers containing Illumina P7 sequences, and primers were annealed by heating at 65 °C for 3 min then cooling to 42 °C at 1 °C per second. At this point, reverse transcription was carried out using SuperScript IV (Invitrogen) according to manufacturer’s instructions, with the addition of 0.25 µl of 40 µM of a template-switching oligo containing Illumina P5 sequences and unique molecular identifiers (UMIs). The reaction was incubated for 1 h at 42 °C. Following alkaline hydrolysis of RNA, the product was purified using magnetic SPRI beads and amplified using 0.5 µM i5/i7 Illumina indexed primers in Q5 High-Fidelity Master Mix (NEB).

Analysis of RNA-seq data

All sequencing data were mapped to the GRCh38.p12 genome using the GENCODE v29 basic genome annotation. All 3′ end sequencing data were trimmed using cutadapt (v4.4)⁶² to remove poly-A sequences and Illumina adapters. All data were processed using the nf-core RNA-seq Nextflow pipeline (v3.12.0)⁶³, and in the case of the 3′ end sequencing data, the additional option –noLengthCorrection was provided to Salmon to prevent length correction for gene expression.

Differential expression analysis was conducted using DESeq2 in R⁶⁴, with effect size shrinkage using the apeglm package⁶⁵. Changes in the nuclear:cytoplasmic ratio between two conditions were assessed using the interaction term of the models (fraction × time). Splicing analysis was conducted using rMATS (v4.1.2)⁶⁶ using the skipped exons calculated using the JCEC quantifications.

Transcriptome-wide GA-multivalency scoring

To classify genes based on their degree of GA multivalency, a GA-multivalency score was calculated for each gene based on the CDS of its primary transcript, defined by the longest annotated CDS, with ties settled by total transcript length. First, for each of the three previously identified GA-rich GeRM clusters (A-rich and G-rich AG, AG + C), representative k-mers were derived as described in the section ‘Identification and classification of GeRM regions’. Then, raw multivalency scores of 5-mers were scaled to all other instances of the same 5-mer, such that their mean score was 0 and their standard deviation was 1. For each transcript, the sum of all these scaled GeRM scores was summed across all representative 5-mers. On the basis of this metric, transcripts that have many instances of GA-rich 5-mers specifically in multivalent contexts are rewarded. All processed data from 3′ end sequencing experiments can be found in Supplementary Table 4.

pSILAC

Cells were grown for 48 h in lysine-free and arginine-free DMEM (Thermo Scientific), supplemented with 10% dialysed FBS (Thermo Scientific), 600 mg l⁻¹ proline and 100 mg l⁻¹ arginine and lysine (‘light’ media). Cells were induced with 500 ng ml⁻¹ doxycycline for 12 h before switching to either medium arginine ¹³C₆ (88210, Thermo Scientific) and lysine 4,4,5,5-D4 (DLM-2640, Cambridge Isotope Laboratories; ‘medium’ media) or heavy arginine ¹³C₆ ¹⁵N₄ (89990, Thermo Scientific) and lysine ¹³C₆ ¹⁵N₂ (88209, Thermo Scientific; ‘heavy’ media) for 8 h. Media were refreshed every 2 h, with doxycycline treatment maintained throughout.

Mass spectrometry

Cells were washed with PBS and lysed in 4% SDS, 100 mM Tris-HCl pH 7.5 buffer supplemented with protease inhibitors (Sigma). Samples were next sonicated, and their protein concentrations quantified via DC protein assay (Bio-Rad).

Lysates were reduced by the addition of dithiothreitol to a final concentration of 5 mM, vortexed briefly and incubated at 56 °C for 30 min. Samples were cooled to room temperature and alkylated with iodoacetamide to a final concentration of 20 mM. Following brief vortexing and incubation in the dark at room temperature for 20 min, residual iodoacetamide was quenched with 5 mM dithiothreitol for 10 min. Trypsin was added at a 1:100 enzyme:substrate ratio. Samples were vortexed, centrifuged for 1 min at 14,000 rpm and incubated overnight at 37 °C shaking at 750 rpm.

Peptide samples (1 µg on column) were analysed by liquid chromatography with tandem mass spectrometry (LC–MS/MS). Chromatographic separation was performed using the U3000 UHPLC NanoLC system (Thermo Scientific) and peptides were resolved by reversed phase chromatography on a 75-µm C18 Pepmap column (50 cm length) via a three-step linear gradient comprising 80% acetonitrile in 0.1% formic acid. The gradient was delivered to elute peptides at a flow rate of 250 nl min⁻¹ over 120 min, initially at 4% solvent at 0–10 min, then increasing to 30% solvent (10–75 min) and 40% solvent at 75–80 min. This was followed by a wash step with 99% solvent (85–90 min) and a final equilibration step with 4% solvent (90–120 min).

Peptides were ionized by electrospray via an Orbitrap Fusion Lumos mass spectrometer (Thermo Scientific) operating under Xcalibur (v4.3). Data were acquired in data-dependent acquisition mode using an Orbitrap-Ion Trap method with a 3-s cycle time between a full MS scan and MS/MS fragmentation by collision-induced dissociation. Orbitrap spectra were collected at 120,000 resolution over an m/z range of 350–1,600, with an automatic gain control of 4.0 × 10⁵ (100%) and maximum injection time of 35 ms. Monoisotropic precursor selection was enabled for charge states +2 to +5 with an intensity threshold of 5 × 10³–1 × 10²⁰ and dynamic exclusion of 35 s ± 10 ppm. MS2 precursor ions were isolated in the quadrupole with a mass width filter of 1.6 m/z. Ion trap fragmentation spectra were collected with an automatic gain control target setting of 1.0 × 10⁴ (100%) with a maximum injection time of 35 ms with collision-induced dissociation collision energy set at 35%.

Raw mass spectrometry data were processed into peak list files using Proteome Discoverer (v3.1; Thermo Scientific) and searched with the Sequest algorithm⁶⁷ against the Uniprot Human isoform database (188,476 entries; September 2024). This was concatenated with the mScarlet fluorescent protein sequence. Database searching was performed at a stringency of 1% FDR including a decoy search with precursor ion intensity quantification. Post-translational modifications to amino acid residues included carbamidomethylation (C) and variable oxidation (M), SILAC ¹³C₆ (R) and SILAC ²H₄2 (K).

The proportion of SILAC-labelled peptides for each protein was compared between doxycycline-treated and untreated samples to determine the change in translation or protein turnover. The mean log₂ fold change for all proteins belonging to different multivalency classes in each replicate was calculated and this per-replicate mean was used for pairwise significance testing. The proportion of the total proteome or total R-MCD abundance attributed to mScarlet–PPIG_LCD overexpression was calculated using the peptides assigned to mScarlet, and R-MCDs were classified as previously defined R-LCDs (those containing at least 20% arginine), with the additional constraint that the net charge must be positive and the total fraction of charged residues in the LCD must be greater than 40%.

Immunofluorescence

Immunofluorescence was performed by fixing cells in 4% paraformaldehyde and 0.1% glyoxal at room temperature for 10 min. Samples were permeabilized by incubating with PBS with 0.5% Triton-X for 5 min at room temperature. Samples were blocked using PBS with 3% BSA and 0.1% Tween (blocking solution). Primary antibody incubations were carried out in blocking solution for 1 h at room temperature or overnight at 4 °C. Samples were washed three times with PBS with 0.1% Tween for 5 min, and secondary antibody incubations were carried out in blocking solution for 1 h at room temperature in the dark. Samples were washed twice, then incubated with 200 ng ml⁻¹ DAPI for 15 min in the dark at room temperature. DAPI was washed out and samples were placed in glycerol mounting medium (90% glycerol, 20 mM Tris pH 8 and PBS).

The following antibodies were used at the following concentrations: rabbit anti-TRA2B (1:1,000; ab31353, abcam), rabbit anti-SON (1:1,000; HPA023535, Sigma) and mouse anti-SC35 (1:500; S4045, Sigma). Secondary antibodies to mouse and rabbit, conjugated to Alexa Fluor 488 or Alexa Fluor 647 were used at 1:500 dilution (Abcam). This included goat pAb anti-rabbit IgG Alexa Fluor 647 (ab150079), goat pAb anti-rabbit IgG Alexa Fluor 488 (ab150077), goat pAb anti-mouse IgG Alexa Fluor 488 (ab150113) and goat pAb anti-mouse IgG Alexa Fluor 647 (ab150115).

HCR-FISH

For each mRNA, HCR 12 probe pairs were designed with the B4 amplifier sequences using the HCR 3.0 Probe Maker⁶⁸. HCR 3.0 was performed on cells in µ-Slide eight-well glass bottom slides (80827, Ibidi) as described in the original protocol⁶⁹, with the following modifications. First, fixation and permeabilization was performed as described in the immunofluorescence section. After pre-hybridization, primary probe hybridization was performed with a probe concentration of 10 nM for 3 h. Overnight HCR amplification was performed in a volume of 125 µl with half the concentration of HCR hairpins (Molecular Instruments). Following HCR amplification, samples were washed according to the original protocol, and then immunofluorescence was performed as previously described, but with all buffers containing 2XSSC in place of PBS.

In all cases of R-MCD overexpression transfections, HCR-FISH was performed on HeLa cells 24 h post-transfection with 50 ng of mGreenLantern-fused R-MCD plasmid, in combination with 200 ng of pUC19 filler DNA.

oligo-d(T) FISH

Samples for oligo-d(T) FISH were fixed as described for immunofluorescence. Samples were incubated with HCR 3.0 hybridization buffer for 30 min at 37 °C, then in hybridization buffer containing 1 µg ml⁻¹ oligo-dT(25)-Cy5 for 2 h. Samples were washed twice in 5XSSC with 0.1% Tween, stained with 200 ng ml⁻¹ DAPI in the same buffer and washed once more before glycerol mounting medium was added.

Microscopy and image analysis

Microscope images were acquired using an Olympus IX3 Series (IX83) inverted microscope, equipped with a Yokogawa W1 spinning disk and a Hamamatsu Orca Fusion CMOS camera (pixel size of 6.5 μm, 2,304 × 2304, 5.3 megapixels).

All images were z-projected using a maximum projection. Segmentation of nuclei and cytoplasms was performed using Cellpose (v2.0)⁷⁰ using the cyto2 model. Segmentation of nuclear speckles was performed using CellProfiler using the Otsu thresholding algorithm, and nucleoplasms were considered all non-speckle regions of the nucleus. The mean signal intensity for HCR-FISH or immunofluorescence within speckles was compared with the mean intensity in the nucleoplasm for each cell to create a speckle enrichment ratio. Mean intensities in the nucleoplasm and nuclear speckle were calculated per cell, but statistical comparisons between groups counted each replicate as a single observation, taking the mean values across all cells in the replicate.

Features predicting nuclear retention of RNA

Using predictive modelling, we aimed to evaluate the contribution of CDS length, EJC density and different types of multivalent features on nuclear retention of endogenous transcripts in interstasis. To disentangle the contributions of the different multivalent features, we first cleaned the data to reduce collinearities. Representative k-mers were assigned for each of the eight GeRM clusters as described in the section ‘Identification and classification of GeRM regions’. Subsequently, k-mers belonging to multiple clusters were retained only in the cluster to which they contributed the highest proportion of multivalency, ensuring no k-mers overlapped between clusters.

Next, the raw k-mer multivalencies for all k-mers belonging to a cluster were summed across CDSs of representative transcripts, defined as the transcript with the longest CDS per gene, and normalized by CDS length. These length-normalized multivalency sums were then standard scaled across the transcriptome so that the mean transcriptome-wide multivalency of each GeRM cluster was zero. We assessed the correlations between the multivalencies of the eight GeRM clusters and found a high correlation among the C-rich, CUG-repeat, G-rich, G/C and GA + C multivalent classes. Given this high correlation and the similarity of their k-mers, we combined these clusters into a single GC-rich cluster for training. We repeated the scoring with the combined k-mers from all these clusters. This approach efficiently minimized collinearities between features, except for a moderate anti-correlation (Spearman’s r = −0.75) between the GC-rich and A-rich clusters. Features used for training are summarized in Supplementary Table 8.

Then, we trained a classification model to differentiate between mRNAs retained in the nucleus after PPIG_LCD induction and control mRNAs that did not show a significant change in their nuclear-to-cytoplasmic ratio under these conditions. mRNAs were classified as retained if they had a q < 0.05 and a log₂ fold change greater than 1 at 12 h post-PPIG_LCD induction, resulting in 275 transcripts. Control mRNAs were defined by a q ≥ 0.05 and a log₂ fold change between −0.1 and 0.1, yielding 1,697 transcripts.

Owing to significant imbalance between the positive (nuclear retained) and the negative (control) class, we used a BalancedRandomForestClassifier from imbalanced-learn module (v0.12.3)⁷¹, leveraging sampling of the majority class to balance the classes in the process of training (sampling_strategy = ‘not minority’). The classifier was configured with 100 trees (n_estimators = 100) and used bootstrap sampling with replacement to enhance generalization. Gini impurity criterion was used for node splitting and the minimum number of samples required at a leaf node was set to 15. For evaluation of the classifier’s performance, we used fourfold stratified cross-validation. This method splits the data into fourfolds, with each fold serving as a test set once while the remaining threefolds serve as the training set. Stratified k-fold cross-validation ensures that each of the folds maintains the same proportion of positive and negative samples as the entire dataset, thereby preventing skewed performance metrics. In this case, the model was trained four times, with each training set comprising 75% of the data and each test set comprising 25%, ensuring balanced class representation in both training and testing. Visualized AUROC metrics were averaged across folds to obtain a robust estimate of the classifier’s performance.

To identify the optimal set of input features and assess their effect on predictive performance, we used a stepwise feature-selection approach. Initially, we trained models with individual features and progressively expanded the feature set, by selecting the best-performing single feature or feature combination and adding the remaining features on top of it. This process was repeated until adding more features no longer enhanced the performance of the model. All models were evaluated using fourfold cross-validation, as described.

Reporter library design

The mScarlet–PPIG_LCD library was designed by concatenating the CDSs for mScarlet and PPIG_402–684. The concatenated sequence was then iteratively synonymously mutated such that the GC content of the sequence was maintained within 2% of the original sequence, choosing codons with fewer As and Gs until the number of AG-only 5-mers reached a minimum. Untranslated regions were added to this sequence. Seven equally spaced sequences of 28 nt were chosen along the sequence to be used as overlap sequences for future Gibson assembly. These overlap sequences were protected from future in silico mutagenesis.

Twenty-six constitutively spliced introns ranging from 83 nt to 384 nt in length were selected from housekeeping genes. We generated 25 different combinations by randomly choosing 8 of these introns, then performed the following iterative optimization algorithm to each of these 25 combinations to generate sequences with optimized predicted splicing.

First, introns were inserted in the reporter CDS in positions that were exactly between the previously defined overlap sequences or the start and end codons. The probability of donor and acceptor site usage was predicted using SpliceAI, and all probabilities were summed. Next, 100 coding sequences with 5 random synonymous mutations were generated, and a sequence was selected that best maintained GC content and kept GA 5-mer content low. After this, an intronic sequence was selected with a 80% chance to be mutated at 5% of its positions and a 50% chance to be shifted by up to 20 nt. The probability of donor and acceptor site usage in the new sequence was predicted with SpliceAI, and the sum of the probabilities was compared with the previous sequence. If the total probability of splice site usage was increased in the mutated sequence, then the mutated sequence was put forwards for another round of mutation, until 2,000 iterations were completed.

To create a GA-rich version of the reporter CDS, the iterative process was repeated, but with synonymous mutations that maintained GC content but maximized GA-rich 5-mer content. The sequence and position of the introns were kept constant for this process.

To determine which of the 25 possible combinations of 8 introns resulted in the best-predicted splicing in both GA-poor and GA-rich reporter sequences, we selected the intron set with the highest summed probability of predicted donor and acceptor site usage. This resulted in GA-poor and GA-rich codon-biased reporter sequences in which both splice sites for each of the eight introns was predicted to have at least a 98.5% probability of being spliced.

For both the GA-rich and the GA-poor sequence, eight sequence segments were generated that spanned from each overlap sequence to the next (or to the untranslated region) and that either contained an intronic sequence or not. Therefore, each of the eight segments of the final reporter gene varied in two possible ways: GA richness and the presence or absence of an intron, and each of these variable segments shared a fixed overlap sequence with the neighbouring segment.

Reporter library generation

All 32 possible segment variations were ordered as double-stranded DNA fragments from IDT. In addition to the overlap sequences that would be part of the final reporter transcript, additional overlap sequences were added for intermediate assembly steps. Specifically, an overlap sequence was added to the 3′ end of the first segment and to the 5′ of the last segment, and another overlap sequence was added to the 3′ ends of the first and fifth segments and the 5′ ends of the fourth and last segment. Then, two pools containing all variations of the first four segments or all variations of the last four segments were made, such that intron-containing variants had twice the molarity of intronless variants at each position, but all positions had the same overall molarity. A Gibson assembly reaction was carried out on each pool, and due to the additional overlaps on the first, fourth, fifth and final segments, circular pieces of double-stranded DNA were formed, each containing one of the four variations at each of the four positions used for that pool.

These circular pieces of DNA were then linearized and amplified by PCR, and the linearized fragments deriving from the first and second half of the reporter gene were again Gibson assembled together. Again, because of the additional overlap sequences attached to the first and last segments, the resulting assemblies were circular. Each assembly contained a 5′ UTR, a CDS encoding mScarlet–PPIG_LCD but with a variable amount of GA-rich 5-mers and a variable number of introns, and a 3′ UTR. These assemblies were then linearized and amplified by PCR and Gibson assembled into a plasmid backbone (pTwist-CMV) containing a CMV promoter and a barcoded 3′ UTR sequence with a poly-adenylation site. Therefore, the plasmids in the pool contained different versions of the reporter gene, each with a unique barcode in their 3′ UTR.

After each reaction, the product was purified using Mag-Bind TotalPure beads. All PCRs were performed using Q5 High-Fidelity Master Mix (NEB). All Gibson assemblies were performed using NEBuilder HiFi DNA Assembly (NEB).

To generate a doxycycline-inducible PiggyBac integration vector containing the reporter library (ePB-ExportReporter), reporter genes and their barcoded UTRs were amplified out of the original reporter plasmid library with a low number of PCR cycles, before being inserted into an all-in-one PiggyBac plasmid (ePB-Bsd-TT-NIL) containing a doxycycline-inducible promoter as well as a constitutive promoter driving the expression of the rtTA protein and blasticidin resistance gene. All designed fragments, as well as example reporter sequences and plasmids, are provided in Supplementary Table 2.

Long-read sequencing

For sequencing of the plasmid pool, the reporter genes were amplified from the plasmid DNA with a minimal number of PCR cycles. For sequencing of the reporter RNA, one well of a six-well plate of HeLa cells were transfected with 500 ng of the reporter plasmid pool and 2,000 ng of a filler pUC19 plasmid for 16 h. RNA was extracted and 500 ng of total RNA was used as input for a SuperScript IV reverse transcription (Invitrogen) reaction using an oligo-dT reverse transcription following the manufacturer’s instructions. The reaction was purified with magnetic beads, then amplified according to the same approach as for the plasmid DNA. The specificity of the amplicon was confirmed with an agarose gel. In both cases, the PCR product was purified with magnetic beads (Mag-bind TotalPure NGS, Omega-Bio-tek) and Nanopore adapters were added using the Nanopore Ligation Sequencing Kit (Oxford Nanopore), according to the manufacturer’s instructions. The libraries were sequenced separately using MinION R9.4.1 Flow Cells (Oxford Nanopore).

Raw Nanopore data were basecalled with Guppy (v6.0.1). The basecalled reads were mapped to a genome of possible reporter transcripts using minimap2 (ref. ⁷²), and barcodes were extracted by matching the upstream and downstream sequences using the R package stringdist⁷³.

Targeted sequencing and its analysis

Targeted sequencing libraries were prepared from 500 ng of RNA input per sample. A targeted reverse transcription was performed with SuperScript IV (Invitrogen) using a reverse transcription primer that was specific to the reporter 3′ UTR and contained a 4-nt experimental barcode, a 10-nt UMI and an Illumina p7 adapter sequence. The reaction was performed according to manufacturer’s instructions. Samples were multiplexed in groups of six at this point, and the RNA was removed by alkaline hydrolysis for 15 min, the pH was neutralized with hydrochloric acid, and the cDNA was purified with magnetic beads. A nested PCR was then performed in which ten cycles with one set of primers was performed. Of the reaction, 5% was spiked into the next six-cycle reaction, in which an Illumina p5 adapter sequence was added. The PCR was purified with magnetic beads (Mag-bind TotalPure NGS, Omega-Bio-tek), and Illumina i5 and i7 indices were added by a final PCR. Libraries were sequenced on a NovaSeq 6000 by the Advanced Sequencing Facility at The Francis Crick Institute. All primer sequences can be found in Supplementary Table 2.

Targeted sequencing libraries were analysed in R. Libraries were demultiplexed using the barcodes introduced by the reverse transcription primer, and the plasmid barcode was extracted by matching the flanking constant regions using fuzzy string matching using the R package stringdist. Reads containing the same barcode were deduplicated by their UMI sequences. Barcodes were counted and related to their reporter gene characteristics based on the long-read sequencing data. Barcode counts were normalized to the total number of assigned barcodes for each sample. Barcodes with low numbers of counts were excluded from further analysis. Nuclear-to-cytoplasmic ratios for each barcode were calculated for each replicate. All processed data for each experiment can be found in Supplementary Table 3.

Cell line generation

We cloned the reporter pool into a doxycycline-inducible PiggyBac vector and generated the PiggyBac cell lines by co-transfecting HeLa cells with equal amounts of the ePB-ExportReporter plasmid pool and a plasmid expressing the PiggyBac transposase. A transposase-negative control was also prepared. Cells were selected once with 5 µg ml⁻¹ blasticidin for 4 days, then expanded for 1 week, before being selected again with blasticidin until all cells in the transposase-negative control were dead. This pool of ePB-ExportReporter-positive cells was then expanded and stocks were frozen.

Calculating RBP-binding potential

To understand which RBPs have the potential to bind to GA-multivalent reporters with high affinity, we designed a ‘binding potential’ score, based on RBP-binding motifs obtained with the RNA-Bind-N-Seq method. We downloaded enrichment scores (R scores) for 78 RBPs published by Dominguez et al.³⁰, and for TRA2B³² from the ENCODE portal, yielding motifs for 79 distinct RBPs. Accession codes and relevant concentrations of RNA-Bind-N-Seq experiments are specified in the supplementary material of the original publication (see supplementary table 3 in ref. ³⁰). The R scores for the RBP concentration that yielded the highest motif enrichment were converted to z scores using mean and standard deviation. A tail test was then applied to compute P values, and motifs with a P < 0.05 were selected for further analysis.

To evaluate the reporter sequences in terms of RBP-binding potential, we divided their CDSs into 5-mers using a rolling window of 5 nt. We then counted the 5-mers that were labelled as significant for a given RBP and multiplied the count for each significant 5-mer by its z score. The values obtained were summed into a single score for each RBP (Fig. 5h). The binding potential scores for all evaluated RBPs and reporter sequences can be found in Supplementary Table 9.

Multivalency preference in public CLIP data

All public CLIP data were processed as previously described⁷⁴, with replicates merged. All crosslinks falling within CDS per sample were counted and the proportion of CDSs to total crosslinks was calculated. Samples in the bottom third were excluded. Remaining samples in the bottom third of raw CDS crosslink counts were also excluded, to remove low-quality datasets. To determine whether a protein tends to bind to RNA in multivalent contexts, the percentile of the GeRM score was calculated for each possible 5-mer in all CDSs. For each crosslink in each sample, the mean GeRM percentile of each crosslinked 5-mer was calculated. If a given 5-mer is crosslinked preferentially in multivalent contexts (relative to the average multivalency of that 5-mer), then the mean GeRM percentile will be high. The mean 5-mer GeRM percentiles for the top 50 most crosslinked 5-mers was compared with all other 5-mers, producing a ratio. This ratio represents the general bias of a protein to bind its preferred motifs in a multivalent context. In the examples shown in Extended Data Fig. 8i,j, this ratio is the mean y axis position for the 50 right-most points on the x axis over the mean y axis position of all other points. Only samples with a ratio of at least 1.1, implying a preference for multivalent motifs, were kept for further analysis.

Only transcripts in the top half of total crosslinks per nucleotide in the CDS across all samples were used for further analysis, as samples with a low crosslinking density typically did not contain enough crosslinks per GeRM region to produce meaningful results. For each GeRM region, the ratio of the number of crosslinks falling within that GeRM region compared with the rest of the CDS for that transcript was calculated. The mean of this ratio within each GeRM cluster was calculated for each sample. Samples were then hierarchically clustered and plotted as a heatmap (Fig. 5a).

TRA2B iCLIP and its analysis

iCLIP was performed according to the iiCLIP protocol⁶¹, using crosslinked cell lysate containing 1.5 mg of protein per sample. Two replicates were prepared for each CLIP target. For immunoprecipitation, 5 µg anti-TRA2B (ab31353, abcam) was used. Membrane transfer was performed overnight at 4 °C with reduced voltage (15 V). The entire lane was cut from 40 kDa upwards.

CLIP libraries were trimmed and demultiplexed using Ultraplex⁷⁵ and mapped to a small RNA genome containing all rRNA, small nuclear RNA, tRNA and small nucleolar RNA sequences from GENCODE vM22 using STAR (v2.7.0)⁷⁶. All unmapped reads were then mapped uniquely to the genome using STAR, with the transcriptome-mapping reads output using the flag –quantMode TranscriptomeSAM, and the 5′ end of the read was forced to be aligned using the flag –alignEndsType Extend5pOfRead1. Reads were demultiplexed using UMItools⁷⁷. Crosslink sites were defined as the position immediately upstream of the read.

Metaprofiles around different GeRM regions were created by taking the midpoint of each GeRM region and counting the transcriptome-mapped crosslinks in each position relative to the midpoint. All counts were summed across GeRM regions of the same cluster, then normalized by the number of GeRM regions within the cluster and the number of millions of unique transcriptome-mapping crosslinks in each sample.

TRA2 siRNA transfection

For siRNA-induced double TRA2 knockdowns, 100 nM of ON-TARGETplus siRNA pools, consisting of four siRNAs targeting both TRA2A (LQ-019480-00-0005, Dharmacon) and TRA2B (LQ-007278-00-0005, Dharmacon), were transfected using Lipofectamine RNAiMAX (Thermo Scientific) following the manufacturer’s reverse transfection protocol. Transfections with a 100 nM scrambled non-targeting control pool (D-001810-10-05, Dharmacon) were carried out in parallel. Thirty-six hours after transfection, doxycycline was added to the transfection media to induce mScarlet–PPIG_LCD expression over a period of 12 h. A total of 48 h post-transfection, cells were either fixed for HCR-FISH as previously described or collected for western blotting to confirm efficient knockdown of TRA2A (1:500; H00029896-B01P, Novus Biologicals) and TRA2B. This was quantified via ImageStudio 5.x Odyssey Clx (LI-COR).

Flow cytometry

For each sample, a single well of a six-well plate of ePB-ER HeLa cells was transfected with one of the mGreenLantern LUC7L3-LCD reporter plasmids in which the LUC7L3 sequence was used as a 3′ UTR but was either GA rich or GA poor. Doxycycline was added to the medium at the point of transfection. Four replicates were performed for each plasmid, as well as untransfected and uninduced controls. After 16 h, the cells were dissociated, washed twice with ice-cold PBS and kept on ice while the samples were measured with an LSR Fortessa (BD Biosciences). The intensities of mGreenLantern and mScarlet were measured using a 488-nm blue laser (Coherent Sapphire 100 mW) with a 530/30-nm filter, and a 561-nm yellow green laser (Coherent Sapphire 100 mW) with a 610/20-nm filter, respectively.

Gating of singlets was performed using FlowJo, and subsequent analysis was performed in R. For each plasmid, the mGreenLantern intensities were normalized to the mean intensity values of the uninduced control samples. Cells were binned according to their mScarlet intensity values, and the mean mGreenLantern intensity values for all cells in each bin were calculated per replicate.

Statistics

All statistics were performed in R (v4.2.0). All tests were two-tailed when possible. All pairwise comparisons were made using a Welch t-test unless otherwise stated. When appropriate, P values were corrected for multiple testing using the Benjamini–Hochberg procedure.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.