Expanding the human proteome with microproteins and peptideins

PeptideAtlas database construction and searching

The human non-HLA PeptideAtlas 2023-06 build contains 295 ProteomeXchange datasets (PXDs)⁵⁰ split into 1,172 different experiments that comprise a total of 3.5 billion MS/MS spectra. Sequence database searching was performed using MSFragger⁵¹ v.3.7 using search parameters appropriate for each dataset, depending on alkylation, labelling, fragmentation type, instrument, enrichment strategy and more. All datasets were searched with semi-enzymatic settings (typically semi-tryptic). The search database was 2023-02 THISP level 4 database⁵² (https://peptideatlas.org/thisp/), which included the 7,264 Ribo-seq ORFs from ref. ⁴ as well as other contributed sequences that might be translated. All datasets were searched with generic artifactual variable modifications methionine oxidation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid from glutamic acid or glutamine, and asparagine and glutamine deamidation. The alkylation modification was set as a fixed modification (typically carbamidomethylated cysteine).

Statistical validation of the results for each experiment was performed using the Trans-Proteomic Pipeline (TPP)^53,54 v.7.0 tools PeptideProphet⁵⁵, iProphet⁵⁶ and PTMProphet⁵⁷, and the results were mapped to the human proteome using ProteoMapper⁵⁸, taking known variants into account, and to the genome using the ENSEMBL⁵⁹ toolkit as previously described⁶⁰. A complete list of datasets used and a summary of the search results in each build are available online (https://peptideatlas.org/builds/human/non-hla/). Supplementary Table 15 provides FDR metrics at the PSM-, peptide- and protein-levels, as well as for certain subsets of proteins, including the neXtProt core proteome, the 7264 Ribo-seq ncORFs, as well as all CONTRIB sequences, many of which are putative ncORFs.

The Human HLA PeptideAtlas 2023-11 build comprises a set of 118 HLA immunopeptide-enriched publicly available PXDs, which we split into 592 separate experiments, containing 240 million MS/MS spectra from 9,776 MS runs (Supplementary Table 16). Sequence database searching was performed with MSFragger v.3.7 using search parameters appropriate for each dataset, depending on sample handling. All datasets were searched in no-enzyme mode. While some HLA peptides have a lysine or arginine on the C terminus, and therefore exhibit fragmentation patterns typical of tryptic peptides, many HLA peptides do not have such characteristics and their spectra may therefore have strong b ions and internal fragmentation ions, rather than strong y ions, which are customary in tryptic peptide spectra.

To estimate the FDR of ncORF detections, we used the target–decoy entrapment approach^61,62. We generate fake protein sequences at a 1:1 ratio with target sequences by scrambling the sequence of each target protein and adding it to the sequence database. This gives the search engine and post-processing software the opportunity to assign some spectra to decoy sequences (presumed to be wrong), and then the number of target-sequence mistaken assignments can be estimated as 1:1 with the number of decoy-sequence mistaken assignments. The entrapment part of the approach is to ensure that the software pipeline is not given knowledge of the decoys. The approach is not perfect as decoys do not model target sequences perfectly; in one direction, the similarity of some real protein sequences to one another may cause a small bias toward making errors to target sequences, while, in the other direction, scrambled sequences may yield an observable sequence (not already in the target database) occasionally by pure chance. Despite small imperfections, the technique is widely regarded as sufficiently accurate and is the standard in the community. Supplementary Table 16 provides FDR metrics at the PSM, peptide and protein levels, as well as for certain subsets of proteins, including the neXtProt core proteome, the 7,264 Ribo-seq ncORFs, as well as all CONTRIB sequences, many of which are putative ncORFs.

The search database was the 2023-07 THISP level 4 database⁵² (https://peptideatlas.org/thisp/), which included the 7,264 Ribo-seq ORFs from ref. ⁴ as well as other contributed sequences that might be translated. 299 common contaminants based on the list from ref. ⁶³ minus the human proteins are included in the search database (available at https://peptideatlas.org/thisp/). All datasets were searched with generic artifactual variable modifications methionine oxidation, cysteine cysteinylation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid from glutamic acid or glutamine, and asparagine and glutamine deamidation. Static carbamidomethylation of cysteine was set for experiments that used Iodoacetamide. For samples that were treated with tandem mass tag or SILAC or enriched for phosphorylated peptides, appropriate mass modifications were applied. Statistical validation was performed as described above by the TPP. A complete list of datasets used and a summary of the search results in each build are available online (https://peptideatlas.org/builds/human/hla/).

Protein identifications and categories

Peptides are preferentially mapped using ProteomeMapper⁵⁸ (in TPP v.7.0) to the 20,389 entries (core proteome) and their isoforms of the 2023 version of neXtProt⁶⁴ taking into account all single amino acid variants encoded in neXtProt. Proteins that have 2 or more uniquely mapping non-nested (contained completely within the other) peptides of length 9 or more amino acids, together covering at least 18 amino acids are categorized as canonical by PeptideAtlas. If a protein entry meets the above two-peptide criteria with peptides that cannot be mapped to the core proteome, they are termed non-core canonical. There are nine additional categories, including indistinguishable representative, indistinguishable, representative, marginally distinguished, subsumed, weak, insufficient evidence for various scenarios of ambiguous and redundant evidence. Finally, the categories ‘identical’ are assigned to entries that are sequence-identical to another entry, and proteins that have no peptide evidence whatsoever are categorized as not detected. See ref. ⁶⁵ for an extensive description of the PeptideAtlas protein categories. For reasons of integration with the HPP annual metrics^66,67,68, only sequence entries that belong to the core set of around 20,389 neXtProt⁶⁴ and UniProtKB/Swiss-Prot⁶⁹ protein-coding genes can achieve canonical status.

Manual inspection of ORF MS spectra

Despite extraordinary efforts to minimize false positives, both builds do contain some false positives, and they are most easily found mapping to proteins that are unlikely to be detected. For gene annotation purposes, manual inspection is therefore crucial to ensure that few false positives are reported for extraordinary detections, as described extensively previously¹⁵. We manually inspected each of the peptides corresponding to ncORFs and provided a manual categorization as well as a commentary. The manual categories are as follows: excellent (highly compelling evidence that the peptide identification is completely correct); good (the PSM is likely correct but lacks sufficient quality and coverage of the residues to provide highly compelling evidence); false positive; close but false positive (the PSM has many matching ions and is likely to be almost the correct peptidoform, but slight discrepancies indicate that the true identification is very close but not quite the listed sequence); low information (the ions that are detected are compatible with the identification, but coverage is too low to be compelling). The best peptide-spectrum match is also listed in the Supplementary Tables 2 and 6 as a USI that can be resolved and viewed online (https://proteomecentral.proteomexchange.org/usi/), or in cases in which a USI cannot be achieved, a direct URL for the spectrum in the PeptideAtlas web interface. In any case, all protein entries, peptides and spectra may be browsed using the PeptideAtlas web interface starting at the URLs provided above.

Procedure for manually validating PSMs

(1) Obtain a listing of PSMs for a given peptide in PeptideAtlas. (2) Examine PSMs until at least one PSM provides excellent evidence, and record its USI (Universal Spectrum Identifier) if available, or PeptideAtlas spectrum viewer URL if a USI is not available. For spectra without a PXD number associated with the dataset, a USI is usually not available. This is most common in Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, for which a PXD has not been assigned. PSMs with USIs should be checked at https://proteomecentral.proteomexchange.org/usi/.

(3) Evaluate the PSM as follows. To obtain the ‘excellent’ rating: (i) the combination of b ion and y ion series must yield nearly complete coverage of the proposed peptidoform explanation. For tryptic or tryptic-like peptides (Arg or Lys residue on the C terminus), this will typically mean a nearly complete y ion series and a b ion series that begins at b2 and at least meets the y ion series. For the rules above and below for tryptic-like ions, swap y ion for b ion when there is a basic residue instead on the N terminus. (ii) If there are any prominent peaks beyond the last matching ion peak, suggesting that the sequence should extend with different residues, the PSM is not excellent. (iii) Any gaps in the y or b ion series must not have a plausible unannotated candidate in the gap, implying that the true identification is slightly different than the proposed identification. Such a plausible unannotated candidate must have a mass defect between the ions before and after the gap. (iv) Gaps should have a plausible explanation for low intensity, such as a y ion C-terminal to a proline. (v) For tryptic-like peptides, the y ions N-terminal to a proline should be more intense than surrounding ions, although confounding factors such loss of sensitivity at the high m/z end or other nearby prolines should also be considered. (vi) Strong b2 and corresponding a2 diketopiperazine ions are preferred in HCD spectra. There may be a gap at b1 ions as these are usually not visible unless there is an N-terminal mass modification. (vii) Internal fragmentation ions should be considered when annotating peaks, especially for peptides without basic residues at either terminus. (viii) Mass modifications should be kept to a minimum. (ix) For long peptides especially, there must not be a substantial region with no ions. (x) There should be no prominent unannotated peaks that suggest contamination or misassignment. Internal fragmentation ions and neutral losses should be considered for peaks that are not attributable to ordinary b and y ions.

A flowchart that describes the decision process and the outcome groups is shown in Extended Data Fig. 2e.

Gene annotation

The gene annotation work in this study has been carried out as part of the ongoing GENCODE project using existing workflows¹¹.

Annotating immunopeptidomics MS runs

All HLA-I MS runs were annotated for the source material (cancer versus non-cancer and cell line versus non-cell line) (Supplementary Table 8). These annotations were largely based on what was documented by PeptideAtlas but, for several instances, the category was changed based on data in the publication corresponding to the MS run. HLA typings of MS runs were determined by manually searching the publications corresponding to each MS run. For 4,879 MS runs, the full four-digit HLA typing could be retrieved.

Categorizing HLA peptides

Starting with the 865,922 peptides from the Human HLA PeptideAtlas 2023-11 build, 99 peptides starting with LLLLLLL, PPPPPPP or QQQQQQQ were filtered out. Mappings to entries starting with DECOY, CONTRIB_smORFs_Cui, CONTRIB_sORFs, CONTRIB_Fedor, CONTRIB_Bazz, CONTRIB_HLA, CONTRIB_GENCODE_nearcognate were ignored. All peptides with a length of at least 8 amino acids, mapping to UniProtKB/Swiss-Prot entries with at most 30 distinct mappings were considered to be derived from canonical proteins. These criteria are less strict than those used by PeptideAtlas to avoid mapping canonical protein derived peptides to ncORFs. Peptides with a length of at least eight amino acids, mapping to ncORFs and not canonical proteins, with at most ten distinct mappings were considered to be derived from ncORFs. For peptides with mappings against multiple ncORFs, one ncORF was selected based on the first one alphanumerically. All remaining peptides (those not assigned to canonical proteins or ncORFs) were put in the ‘other peptides’ category. This ‘other peptides’ category contains mainly peptides that map to more than 30 canonical proteins, but also includes among others peptides mapping to more than 10 ncORFs and peptides with a length of 7 amino acids.

ncORF expression in cancer tissues

To determine whether ncORFs were preferentially expressed in cancer or non-cancer tissues, each ncORF peptide was categorized to originate exclusively from MS runs from cancer samples, exclusively from MS runs from non-cancer samples or from both. Moreover, each ncORF (and corresponding peptides) was classified to originate from a cancer gene based on the Cancer Gene Census genes (accessed 4 January 2024)⁷⁰.

ncORF expression in enriched ubiquitination datasets

Public enriched datasets for ubiquitination generated from human cell lines using Thermo Scientific instruments were manually selected from the PRIDE database: PXD003936, PXD006201, PXD019692, PXD019854, PXD020909, PXD022367, PXD023218, PXD023889, PXD025890, PXD027328 and PXD037009. Datasets were reanalysed independently. The search database included the UniProt human reference proteome (one protein per gene, downloaded in April 2024), the 7,264 GENCODE Ribo-seq ORFs and cRAP protein sequences as a contaminants database. Peptide and protein identification was performed using the Comet search engine (v.2024). Default parameters were applied, with the following exceptions: semitryptic digestion, missed cleavages were set to 4 and variable modifications included ubiquitination (Gly–Gly enrichment method), oxidation of methionine and N-terminal protein acetylation. A reversed decoy database was also used to estimate the FDR. Postprocessing of the results was performed using PeptideProphet, iProphet and PTMProphet from the TPP (v.7.1.0). False-localization rates were estimated using a decoy amino acid approach (alanine)⁷¹.

HLA binding predictions

Binding predictions were performed using NetMHCpan (v.4.1)^72,73. Predictions were done for MS runs with a known four-digit HLA typing. For nine MS runs with A24:01, B43:01 or C12:01 as one of the alleles, no predictions could be made because these alleles were not known to NetMHCpan. Supplementary Table 8 shows an overview of MS runs for which binding predictions were made. Peptides were predicted to bind to an allele if the rank score was smaller than or equal to 2. If the HLA typing of an MS run consisted of multiple alleles, the peptide was assigned to the allele with the lowest predicted rank score, irrespective of whether this rank score was smaller than 2 or not.

Detectability determinants

Canonical proteins were categorized as detected and undetected based on whether they were detected by a single peptide. Canonical proteins shorter than 16 amino acids and proteins with amino acid symbol ‘U’ in their sequence were filtered out. ncORFs sequences were categorized similarly to the canonical proteins. Contrary to most other analyses, peptides were not exclusively assigned to a single ncORF, due to which the number of detected ncORFs was larger than in Extended Data Fig. 4b. For the ncORF analysis taking into account only the first or last 30% of the sequence, the requirement was that this 30% was again 16 amino acids long. Significance was determined using a two-sided Wilcoxon rank-sum test.

Hydrophobicity analysis

For the hydrophobicity analysis, all sequences were aligned by the C terminus. Starting at that position and moving towards the N terminus, the average hydrophobicity of the 15 previous amino acids across the sequences was determined. For every position, only sequences long enough to still contain 15 amino acids before the position were taken into account. A line was fit through measurements using local polynomial regression fitting. 95% confidence intervals were determined using a two-sided t-test.

Expression analysis

To compare the expression of detected and undetected ncORFs, we used data from GTEx⁷⁴. The mean FPKM of all genes per tissue (excluding testis) was used. Tissues from the same organ (for example, all brain-derived tissues) were grouped together. For each ncORF, the expression was determined using the gene IDs. Contrary to most other analyses, peptides were not exclusively assigned to a single ncORF, due to which the number of detected ncORFs was larger than in Extended Data Fig. 4b. However, for 326 ncORFs the associated gene ID was not present in GTEx, so these were excluded.

Tissue comparison

For comparing the expression of ncORFs in tissues, the data from the HLA Ligand Atlas (PXD019643) was used²². Tissue names were extracted from the MS run file names. For each tissue, the number of distinct ncORF and CDS peptides was determined, as well as the percentage of ncORF peptides. Statistical significance was determined using multiple Fisher’s exact tests and Holm–Bonferroni multiple testing correction. Gene expression levels were determined using mean FPKM values per gene across tissues from GTEx⁷⁴. Only genes that expressed ncORFs in the HLA Ligand Atlas that were present in GTEx were considered. A selection of GTEx tissues that showed resemblance to the HLA Ligand Atlas tissues was used. Resemblance was based on the similarity of the HLA Ligand Atlas and GTEx tissue names.

Analysis of Ribo-seq data

We manually inspected Ribo-seq data for 183 ncORFs with at least one peptide nominated in the non-HLA build and 699 ncORFs with at least one peptide nominated in the HLA build. We used the GWIPS-viz browser⁷⁵ to assess evidence of ncORF translation with a publicly accessible web portal that enables the research public to examine our assessment of these ncORFs independently. The GWIPS-viz parameters were as follows: the elongating ribosomes (A-site) with the global aggregate track on ‘full’, which reflects native Ribo-seq; and the initiating ribosomes (P-site) with the global aggregate track on ‘full’, which represents Ribo-seq signal from ribosomes enriched at initiation sites. We independently evaluated the native Ribo-seq and ‘initiation’ Ribo-seq data. For each of these data types, we classified the data as insufficient, sufficient or excellent for supporting the translation of a given ncORF. A given ncORF was considered to be verified at the level of Ribo-seq data if either the elongating ribosome or initiating ribosome track data were sufficient or excellent. We defined excellent if there were four sequential clearly identified Ribo-seq peaks in-frame within the first 100 nucleotides of the ncORF. We defined sufficient if there were three sequential clearly identified Ribo-seq peaks in-frame within the first 100 nucleotides of the ncORF. We defined insufficient if there were not clearly sequential in-frame reads. We additionally collated selected ncORFs in the GENCODE set that were first identified in refs. ^1,76 but considered insufficient in the GWIPS-viz database. As GWIPS-viz does not include the data for these two studies, we evaluated the raw data for these selected ncORFs in the primary datasets and categorized their support according to these data. We additionally calculated the percentage of in-frame ribosome footprints (PIF) and uniformity of ribosome coverage for each of the ncORFs supported by one or more peptides in each PeptideAtlas build, as observed in the human body map.

Subsequently, a second phase of manual inspection was conducted on the subset of candidates that were assigned tier 1A supporting evidence. The smaller scale of this second phase effort enabled a more nuanced assessment of Ribo-seq support for these candidates where, alongside GWIPS-viz, we explored each candidate in the transcriptome browser, Trips-Viz⁷⁷, and the more richly populated RiboCrypt (RiboCrypt.org), which contains over 3,600 ribo-seq libraries (available in the RiboSeq.Org portal)⁷⁸ aligned to the human genome. Both additional browsers apply dynamically calculated read-length-dependent offsets leading to periodic signals in the profiles that is clearer than that in GWIPS-viz where fixed offsets are applied for all lengths. These assessments looked to validate the support for the translation initiation and termination sites for each candidate while accounting for transcript isoform complexity and regions of poor genomic mappability. Ribo-seq area plots were obtained using RiboCrypt with the sliding-window function calculating moving average of length 9 of P-site coverage to make informative concise graphics. Zoomed-in regions were obtained with the ‘columns’ setting and no sliding window to maintain single-nucleotide resolution. Through this assessment, higher confidence can be placed in the supporting translation evidence for these tier 1 annotations.

Cross-species comparison of GMCL1 translation

To inspect the translation of GMCL1 uORF (c2riboseqorf47) and its downstream CDS across mammals, we analysed a total of 35 Ribo-seq datasets from the cardiac left ventricle, along with their corresponding RNA-seq datasets, across five mammalian species^1,79. Datasets were aligned to the appropriate Ensembl genomes and transcriptomes (release 98), including human (GRCh38/hg38), chimpanzee (Pan_tro_3.0/panTro5), macaque (Mmul_10/rheMac10), mouse (GRCm38/mm10) and rat (Rnor_6.0/rn6). Mapping was performed using STAR (v.2.7.3a)⁸⁰ with the standard settings and the following modified parameters: –outSAMtype BAM SortedByCoordinate, –outFilterMismatchNmax 2, –outFilterMultimapNmax 20, –alignSJDBoverhangMin 6, –alignSJoverhangMin 500, –outFilterType BySJout, –limitOutSJcollapsed 10000000, –limitIObufferSize 300000000 and –outFilterIntronMotifs RemoveNoncanonical. Later, RiboseQC⁸¹ was used to extract P-site positions from the Ribo-seq data. Ribo-seq reads, RNA-seq coverage, and in-frame P-site distributions over the GMCL1 locus were visualized using the UCSC Genome Browser⁸² for the five selected mammalian species.

Use of the tier classification system

We used the tier classification system for ncORFs initially proposed previously²⁵. Specifically, ncORFs were given an initial or provisional tier based on the information available from the large-scale MS search. After manual review of the nomination data, ncORFs were then assigned a final tier. Tiers were defined as follows:

Tier 1A: two non-nested peptides in MS proteome data, with or without HLA immunopeptidomics data, with Ribo-seq data
Tier 1B: two non-nested peptides in HLA immunopeptidomics data with Ribo-seq data
Tier 2A: one peptide in MS proteome data, with or without HLA immunopeptidomics data, with Ribo-seq data
Tier 2B: one peptide in HLA immunopeptidomics data with Ribo-seq data
Tier 3: any HLA immunopeptidomics and/or tryptic proteome LC–MS/MS evidence without Ribo-seq evidence
Tier 4: Ribo-seq evidence without proteomic evidence
Tier 5: in silico prediction of an ORF on an expressed transcript without any Ribo-seq or proteomic evidence.

MLP classifier model

A dataset comprising 677 ncORF peptide sequences of 9 amino acids, each annotated with 22 attributes, was used to develop a multilayer perceptron (MLP) classifier model. The implementation was carried out using Python 3 and the following software libraries: pandas, numpy, matplotlib and scikit-learn (v.1.5/1.6). The dataset was processed by separating the features from the target variable. The data were then split into training and testing sets, with 80% allocated for training and 20% for testing. To ensure reproducibility, the random state was set to 42 during the split. Before model fitting, the features were standardized using StandardScaler. This preprocessing step involved removing the mean and scaling the features to have unit variance, thereby normalizing the data. The MLP classifier model was initialized with a maximum of 8,000 iterations and a random state of 42 to ensure reproducibility. The model was then subjected to hyperparameter tuning using grid search with cross-validation. The hyperparameters explored included: hidden layer sizes: (280); activation function: ‘tanh’; and regularization parameter (alpha): 0.01. Grid search with cross-validation was used to systematically evaluate the performance of various hyperparameter combinations and identify the optimal configuration. The best-performing model, as determined by the grid search results, was selected and fitted to the training data. Subsequently, this model was used to make predictions on the test set, which had not been seen by the model during training. The performance of the model was assessed using standard evaluation metrics to determine its predictive capabilities.

TensorFlow Keras model

Due to 1,785 ncORFs being detected while 5,479 remain undetected, presenting an approximate ratio of 1:3, a balanced weight for imbalanced dataset was used to address the imbalance and a neural network analysis to build, train and evaluate a TensorFlow Keras model (v.2.18.0). The dataset included 7,264 ncORF amino acid sequences with a selection of 43 attributes. Using Python 3, we imported the necessary libraries including TensorFlow, Keras and various components of TensorFlow and Keras for building and evaluating the model. Using the train_test_split function from scikit-learn, we allocated 80% for training and 20% for testing after separating the training and testing sets from the target variable. The features were standardized using the StandardScaler. A sequential model consisting of multiple layers was built with an input of 16 neurons, ReLU activation, and L2 regularization. To prevent overfitting, we added batch normalization and dropout layers after each hidden layer. The output layer consisted of a single neuron with sigmoid activation for binary classification. We compiled the model using the Adam optimizer with a learning rate of 0.001, binary cross-entropy loss function, accuracy as the metric and it was trained on the training data. During training, it was run for a total of 60 epochs with a batch size of 12. The target variable for the test set was predicted by the model.

Evolutionary conservation and constraint (ORBL)

To compute the ORBLv conservation score of an ORF for the placental and primate clades, we extracted the local alignment of the ORF from the 120 mammal whole-genome alignments with the hg38 reference⁸³ downloaded from https://bds.mpi-cbg.de/hillerlab/120MammalAlignment/Human120way/, and restricted to the 116 placental mammal or 26 primate subsets, respectively. ORBL scores for additional clades, as included in Supplementary Table 11, used whole-genome alignments of 100 vertebrates (downloaded from https://hgdownload.gi.ucsc.edu/goldenPath/hg38/multiz100way/), 470 mammals (downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz470way) and 447 mammals (downloaded from https://cgl.gi.ucsc.edu/data/cactus/447-mammalian-2022v1.fix2.maf.gz). We considered the start and stop to be conserved in a species if the bases (possibly more than three) aligned to the start and stop of the ORF in the reference species included three consecutive bases that were ATG or any stop codon, respectively. We considered the reading frame to be conserved if the aligned bases between the aligned ATG and aligned stop were a multiple of three in total (even if there were insertions or deletions that shifted part of the reading frame), and did not include any in-frame stop codons or gaps in the alignment. We considered the ORF to be conserved in a species if the start, stop and reading frame were conserved. We defined ORBLv to be the phylogenetic branch length of the conserved species divided by the branch length of all 116 placental or 26 primate species in the whole-genome alignments (not just the ones present in the local alignment). In Extended Data Fig. 7b,c plotting ORBLv distributions of ncORFs and MANE Select CDS, we excluded selenoproteins and MANE Select CDS having non-ATG starts, as these would not be considered ORFs by our definition.

As our analysis of evolutionary constraint is strongly influenced by overlap with CDS, we redid the biotype determination for the ncORFs using GENCODE v42 annotations (the original biotypes from ref. ⁴ used v35), and applied strict criteria to determine ORFs with a ‘pure’ biotype (details are provided in the Supplementary Results).

There were 448 of the 7,264 ncORFs that did not satisfy any of these criteria, for example, ORFs that overlapped CDS from two different transcripts in different reading frames; we considered these to have ‘mixed’ biotype and did not compute an ORBLq constraint score for them.

The list of ‘untranslated’ ORFs from which matched ORFs were chosen for the computation of ORBLq was created as follows. We began with every ATG-initiated ORF of any length in any protein-coding or lncRNA transcript in GENCODE v42 that did not overlap a protein-coding CDS of any transcript in the same frame. We did not require them to be maximal, so, for example, if an ORF included a downstream ATG, the list would include another ORF starting at that downstream ATG and ending at the same stop codon. We then excluded ORFs that did not have a ‘pure’ biotype as defined in the Supplementary Results; that overlapped one of the 7,264 ncORFs, an mRNA on the opposite strand, or a pseudogene on either strand; or was a dORF that overlapped some 5′-UTR. We segregated uoORFs, intORFs and doORFs by the frame in which they overlapped the main ORF (+1 or +2) as constraint on the main frame amino acid sequence imposes different ORFness constraint on the two overlapping frames. We were left with 1,717,927 untranslated ORFs. The counts by biotype were: uORF (63,795), uoORF+1 (3,231), uoORF+2 (2,940), intORF+1 (320,823), intORF+2 (60,135), doORF+1 (16,910), doORF+2 (5,946), dORF (653,983) and lncRNA-ORF (590,164). We then computed the ORBLv conservation score for each of these untranslated ORFs.

To compute ORBLq of an ncORF, we selected a matched set of at least 1,000 untranslated ORFs having the same biotype as the ncORF (and the same frame of overlap for uoORFs, intORFs and doORFs), starting with the untranslated ORFs of the same length as the ncORF. If there were fewer than 1,000 of the same length and biotype (which was common for longer ncORFs and ones having less frequent biotypes such as uoORF), we added in all ORFs of the same biotype having length one more or one less than the ncORF, length two more or two less, and so on, until there were at least 1,000 matched ORFs. We then estimated the probability that a similar untranslated ORF would have at least as much conservation as our ncORF, a kind of P value, as

$$P=\,\frac(\rmN\rmu\rmm\rmb\rme\rmr\,\rmo\rmf\,\rmm\rma\rmt\rmc\rmh\rme\rmd\,\rmO\rmR\rmF\rms\,\rmh\rma\rmv\rmi\rmn\rmg\,\rmO\rmR\rmB\rmL\rmv\ge \rmO\rmR\rmB\rmL\rmv* )+1\rmN\rmu\rmm\rmb\rme\rmr\,\rmo\rmf\,\rmm\rma\rmt\rmc\rmh\rme\rmd\,\rmO\rmR\rmF\rms$$

where ORBLv* is the ORBLv of our ncORF, and the pseudocount of 1 is added to prevent P values of 0 (it was not added if the result would be more than 1). We then calculated the ORBL quantile, ORBLq, as 1 − P.

For some analyses, we used an information content measure,−log₁₀[1 − ORBLq], rather than ORBLq itself, to linearize values near 1.

Lentiviral transduction for CRISPR screens

Optimal infection conditions were determined in each cell line to achieve around 30–50% infection efficiency, corresponding to a multiplicity of infection (MOI) of about 0.3–0.5. Final transductions were performed in 12-well plate format with 3 × 10⁶ cells per plate using a polybrene concentration of 4 μg ml⁻¹ and different virus volumes to achieve the desired MOI. Approximately 24 h after infection, cells were trypsinized and approximately 2 × 10⁵ of A375 and A673 cells, 3 × 10⁵ of CADO-ES1, Jurkat and THP-1 cells, and 1 × 10⁶ of HepG2 and RD-ES cells from each infection were seeded in 12 wells of 6-well plates, each with complete medium, one supplemented with 1 μg ml⁻¹ of puromycin for A673, Jurkat, RD-ES and THP-1 cells, 1.5 μg ml⁻¹ of puromycin for A375 and HepG2 cells, and 3 μg ml⁻¹ of puromycin for CADO-ES1 cells. Cells were counted 4–5 days after selection to determine the infection efficiency, comparing survival with and without puromycin selection. Volumes of virus that yielded around 30–50% infection efficiency were used for screening.

CRISPR screening

We selected 2,196 ncORFs from the GENCODE Phase I catalogue⁴ and 1,245 ncORFs from a previous ncORF gRNA library⁷. ncORFs were selected according to the following rules: (1) there are a maximum of three ncORFs per gene selected. Exclude all intORFs as well as doORFs and uoORFs that have ≥25% overlap with the main CDS. (2) Minimum ncORF size of 12 amino acids to enable sufficient gRNA targeting sites. (3) ncORFs with <4 predicted targeting gRNAs were excluded. (4) The overall ncORF distribution in the final library was as follows: tier 1A (n = 4), tier 1B (n = 224), tier 2A (n = 13), tier 2B (n = 373), tier 3 (n = 13), tier 4 (n = 34) and no tier (n = 1,535).

The lentiviral barcoded library contained 27,464 sgRNAs, which were designed using the CRISPick program (https://portals.broadinstitute.org/gppx/crispick/public) from the Broad Institute Genomic Perturbation Platform, using settings for the reference genome Human GRCh38 (Ensembl v.108) for ‘CRISPRko’ with enzyme ‘SpyoCas9 (NGG)’ with the following modifications: (1) for ncORFs with ≥2 exons, all gRNAs could not come from the same exon. (2) A target of 6 gRNAs per ncORF were designed, and a target of 3 gRNAs per parental CDS were designed. (3) The spacing requirement for gRNA separation was reduced to 1% across the total target length for ORFs and maintained at 5% for parental CDSs. (4) A 2:1 on-target to off-target ratio was used. (5) gRNAs with ≥5 predicted mapping sites to the human genome were removed. (6) A previously published⁷ set of 471 pan-essential gRNAs, 503 non-targeting gRNAs without genome cutting and 497 non-targeting gRNAs with genome cutting were included.

Lentiviral infections were performed in biological triplicate with sufficient cells to achieve a representation of 500 cells per gRNA after puromycin selection. Then, 24 h after infection, cells were selected with 1.5 μg ml⁻¹ of puromycin for 7 days, and then approximately 1–2 × 10⁷ cells were collected for assessing the initial abundance of the library. Cells were passaged every 3–4 days and collected about 14 days after infection. For all genome-wide screens, genomic DNA (gDNA) was isolated using Midi or Maxi kits for the validation screens gDNA was isolated using Midi kits according to the manufacturer’s protocol (Qiagen). After PCR amplification of barcodes, the samples were sequenced on the HiSeq2000 or NextSeq (Illumina) system. For analysis, the read counts were normalized to reads per million and then log₂ transformed. The log₂[FC] of each sgRNA was determined relative to the initial timepoint for each biological replicate.

Analysis of CRISPR screening data: sgRNA mapping and hit calling

We began by standardizing the sgRNA nomenclature across each library. For sgRNA targeting locus assignment, we used a two-step computational approach to ensure accurate and comprehensive mapping of guide RNAs to their genomic targets. Initially, Bowtie2 v.(2.5.4) performed sequence alignment using stringent parameters (N=0, gbar=1, ma=2, mp=0,0), optimized for short RNA sequences. This allowed for up to 100 potential mapping locations per sgRNA while maintaining high specificity. Subsequently, SAMtools (v.1.20) and BEDTools (v.2.31.1) processed these alignments to generate genomic coordinates and intersect them with gene annotations. These mapping results were then integrated with GENCODE v45 gene annotations and GENCODE Phase 1 ORF definitions to create a harmonized dataset. During this integration, sgRNAs mapping to more than five genomic coordinates were filtered out, and ORFs with fewer than three guides were excluded, yielding the final mapping files. For multi-ORF sgRNAs, a hierarchical naming scheme prioritized primary targets, while shared transcript annotations were preserved.

We removed sgRNAs with low counts less than 3 s.d. below the total counts. Additional quality-control measures included evaluating replicate consistency and log-transformed fold change. Chronos (v.2.0.8) was used to process the raw counts from the CRISPR screen, incorporating copy-number variation (CNV) data. CNV information was retrieved from https://depmap.org; if not accessible, the CNV value was set to 1 (ref. ⁸⁴). To assess the quality of the CRISPR screen, the mean median absolute deviation (MMAD) was calculated for each library, measuring the distance between positive and negative controls. Chronos outputs fitness scores for targeted loci, with loss-of-function hits defined by scores greater than 0.5 or less than −0.5.

For more stringent hit selection, CRISPR screen results were cross-validated with RNA-seq and ribosome profiling data for the 8 overlapping cell lines. After data processing, only genes with expression levels ≥5 TPM in Ribo-seq and ≥10 TPM in RNA-seq were retained, reducing the likelihood of off-target effects and refining functional hit calls.

To further dissect the functional relevance of ncORFs, we computed normalized phenotypic effects by subtracting the Chronos score of the associated canonical coding sequence (CDS) from that of the ncORF. This normalization was performed only for ncORFs shown to be co-transcribed with a canonical ORF, ensuring that shared transcript context was preserved in the interpretation of fitness effects

Meta-analysis of CRISPR data

Three independent CRISPR screens (refs. ^7,28 and the current dataset generated for this study) were integrated by first mapping sgRNA sequences to the GRCh38 reference genome. Target annotations were derived using GENCODE v45 augmented with phase 1 ncORF annotations (https://www.gencodegenes.org/pages/riboseq_orfs/). sgRNAs mapping to more than five genomic loci or exhibiting poor alignment quality were excluded from further analysis. This was then used as the mapping design. Raw counts data were obtained from the original publications. Gene-level essentiality scores were calculated using Chronos⁸⁴. CNV data for each cell line were retrieved from depmap (release 25Q2). And for missing lines CNV was set to 1. Genes with a Chronos score below –0.5 were considered to be significant hits, following the authors’ recommended threshold. ORFs where 60% of the samples show essentiality were considered pan-essentiality.

Comparison of CRISPR–Cas9 with Cas13 data

The CRISPR–Cas13 dataset was retrieved from ref. ⁸⁵. Among the available cell lines, THP-1 was the only line overlapping with the Cas9 screens used in this study and was therefore selected for comparative analysis. Only targets shared between the Cas9 and Cas13 datasets were considered. For both platforms, normalized gene-level depletion scores were used to assess concordance, focusing on overlapping protein-coding and non-coding transcripts. Comparisons were restricted to high-confidence targets based on consistent guide performance and coverage in both datasets.

Analysis of CRISPRa data

Human CRISPRa Calabrese (P65 HSF) Pooled Libraries data were retrieved from ref. ³¹. Screens from Meljuso and A375 cell lines were reprocessed by first removing low-representation targets. Using the raw counts from the original manuscript and a harmonized mapping dictionary, fitness scores were calculated using Chronos⁸⁴.

CRISPR tiling analysis and functional enrichment score for ncORF

Two CRISPR tiling screens^7,28 were reprocessed by first standardizing sgRNA nomenclature across libraries. log₂[FC] values were calculated by comparing the late-timepoint to the input plasmid DNA or earliest time point. For each cell line, gRNA with low representation were excluded, then library depth was corrected using CPMs, using the geometric mean of guides targeting positive control genes (that is, pan-essential genes). Normalized log₂[FC] = (gRNA log₂[FC]/mean positive control log₂[FC]) × −1

This scaling anchors positive control dropout to –1, allowing consistent interpretation across experiments. We then integrated genomic mapping data (coordinates and annotations) with normalized log₂[FC] values to evaluate ncORF essentiality. ORF-level effects were denoised using empirical Bayes shrinkage in the ashr package (v.2.2.63), fitting a global prior distribution and shrinking individual ORF estimates accordingly. For ORFs with >6 sgRNAs, we applied median absolute deviation (MAD) filtering, removing guides with log₂[FC] values > 3 MAD units from the median to reduce the impact of outliers. To assess significance, we used a gene-specific permutation-based null model accounting for gene-level background of loss of effect. For each ncORF with n guides, we generated null distributions by resampling n guides from all guides targeting the same parental gene. Summary statistics (mean or median log₂[FC]) were calculated for each permutation (typically 1,000–4,000 iterations). One-sided P values were calculated as P = (1 + number of null statistics ≥ observed statistics)/(1 + B), where B is the number of permutations. These were then converted to two-sided P values using P_two-sided = 2 × min(P_one-sided, 1 − P_one-sided). This approach accounts for guide count variability and gene-specific effects, enabling robust detection of essential ncORFs within the hierarchical structure of overlapping gene models.

Pooled c10riboseqorf92 knockout

Pooled knockout screens in the PRISM cell line set were performed as previously described²⁸. This approach uses 486 barcoded human cancer cell lines, which are pooled and grown together in RPM1640 medium supplemented with 10% FBS. gRNAs used were non-cutting LacZ control (AACGGCGGATTGACCGTAAT), cutting control Chr2-2 (GGTGTGCGTATGAAGCAGTGG), c10riboseqorf92 1 (ACAGGGCACTGGTCTCCCAA) and c10riboseqorf92 2 (CAAGGCTGTATATTTCACCT). For pooled screening, 400,000 pooled cells per well were plated in a 6-well plate with a cell pellet collected for a no infection control. Then, 24 h later, cells were subjected to lentiviral infection with gRNA and Cas9 using an all-in-one plasmid. A lentiviral MOI of 10 was used, and transduction was performed with 4 μg ml⁻¹ polybrene. On day 4, the cell culture medium was changed to include 1 μg ml⁻¹ puromycin for 72 h, after which antibiotic-free medium was used. Cells were then passed every 72 h and a cell pellet (2 × 10⁶ cells) was collected for DNA on days 6, 10 and 15. Genomic DNA extraction of cell pellets was performed using the DNA Blood and Tissue Kit according to the manufacturer’s instructions (Qiagen). Cell line representation was determined by amplifying barcodes with universal primers. PCR products were pooled and purified with AMPur beads (Beckman Coulter). The DNA concentration was measured using Qubit fluorometric quantification (Thermo Fisher Scientific). DNA was sequenced on the NovaSeq (Illumina) system at the Genomics Platform at the Broad Institute.

Analysis of pooled c10riboseqorf92 knockout data

At day 15/20, 471 out of 486 cell lines were detectable and were used for data analysis. Cell line abundance was determined by RNA expression of each cell line’s barcode using RNA-seq as previously described²⁸. log₂-transformed fold changes in abundance were calculated by comparing each cell line’s day 15 abundance to that of the input plasmid pool. Replicates from days 15 and 20 were averaged. To identify outliers, linear regression was performed in R (v.4.4.2) to model sgRNA2 viability as a function of sgRNA1 viability at day 15, and residuals were computed from the fitted model. Cell lines with absolute residuals exceeding two standard deviations from the mean were flagged as outliers and excluded, along with those containing missing values (NaN). To assess phenotypic similarity between c10riboseqORF92 knockout and genome-wide perturbations, we correlated its viability profile with CRISPR screening data from the DepMap project (release 25Q2). Gene-level dependency scores were retrieved for all genes and cell lines, and only those overlapping with the c10riboseqORF92-KO dataset were retained. Spearman correlation coefficients were then computed between the c10riboseqORF92 profile and each gene’s profile across the matched cell lines.

siRNA and overexpression experiments

Four siRNAs targeting distinct regions of OLMALINC (siRNAs 10, 11, 12 and 13) were tested in A375 and A549 cell lines to evaluate knockdown efficiency. Cells were seeded into six-well plates and transfected when 80% of confluence was reached, using the Mirus TransIT-X2 transfection kit according to the manufacturer’s instructions. At 48 h after transfection, total RNA was extracted using the Zymo Total RNA Isolation Kit. The knockdown efficiency was assessed by quantitative PCR (qPCR) and analysed using QuantStudio Design & Analysis Software (v.1.6.1), using GAPDH and ACTB as the housekeeping genes. Expression levels of OLMALINC and the associated c10riboseqORF92 were quantified. The same transfection and RNA isolation protocol was used for generating the samples used in bulk RNA-seq analysis. qPCR primers were as follows: OLMALINC forward, AGGAACATCTTGCCAATTTCA; OLMALINC reverse, TGTGGATCTTCAGTTGCTTCA; GAPDH forward, TGCACCACCAACTGCTTAGC; GAPDH reverse, GGCATGGACTGTGGTCATGAG; ACTB forward, AAGGCCAACCGCGAGAAG; ACTB reverse, ACAGCCTGGATAGCAACGTACA; P1 sense primer for plasmid expression, TCTTGTGGAAAGGACGA; P2 antisense primer for plasmid expression, TTAAAGCAGCGTATCCACATAGCGT.

To assess the impact of OLMALINC knockdown on cell proliferation, 4 replicates of 2,500 cells per well were seeded into 24-well plates and transfected with siRNAs as described above. Cell proliferation was monitored in real-time using the Incucyte Live-Cell Analysis System. Images were acquired at 8 h intervals, and the cell confluence (percentage surface area occupied) was measured over time. The culture medium was refreshed every 24 h throughout the experiment. Proliferation curves were generated using the confluence data exported from the Incucyte software. Growth curves were fitted in R using the growthcurver (v.0.3.1) package, which models logistic growth and computes the AUC as a summary metric of proliferative capacity. AUC values were used for statistical comparison of proliferation across treatment conditions.

Bulk RNA-seq and data analysis

Cell lines were cultured in six-well plates until around 80% confluence. Cells were collected by gentle scraping, washed with 1× PBS, and pelleted by centrifugation. Cell pellets were lysed in RIPA buffer (Pierce 8900) supplemented with 1× protease inhibitor cocktail (cOmplete Mini, EDTA free, Roche) and homogenized on ice for 15 min. Total RNA was extracted using the Zymo Research Direct-zol RNA Miniprep Kit (R2052) according to the manufacturer’s protocol. The RNA concentration and purity were initially assessed using a NanoDrop spectrophotometer, and preliminary quality control was performed to assess potential contamination.

RNA-seq libraries were prepared by the University of Michigan Advanced Genomics Core using a poly(A) enrichment protocol. The fragment size distribution was evaluated using the Agilent 2100 Bioanalyzer (Agilent Technologies). Libraries were sequenced on either the Illumina NovaSeqX platform or Element Biosciences AVITI24 system, generating 300-cycle paired-end reads. RNA-seq count tables were obtained using Nextflow (v.25.04.2; https://www.nextflow.io), based on nf-core/rnaseq v.3.1.19. Raw FASTQ files were quality-checked with FastQC (v.0.11.9) and adapter trimming was performed using Trim Galore (v.0.6.7). Reads were aligned to the human reference genome (GRCh38.p14) using STAR, with the –twopassMode Basic option enabled. Gene-level and isoform-level expression quantification were performed using RSEM (v.1.3.3), referencing GENCODE v45 gene annotations. Differential gene expression analysis was conducted using DESeq2 (v1.46.0) in R (v.4.4.2). A multifactorial design formula was used to model the main effects of treatment (siCtrl versus siOLMALINC knockdown), genetic background (c10riboseqorf92 wild-type versus overexpression) and their interaction: design = ~ treatment + background + treatment:background. Genes with an adjusted P value (Benjamini–Hochberg FDR) < 0.05 and an absolute log₂[FC] > 0.5 were considered significantly differentially expressed. log₂[FC] shrinkage was applied using the apeglm method. Functional enrichment analysis was performed using the clusterProfiler R package (v.4.14.6), querying the MSigDB Hallmark gene sets (v2025.1.Hs). Enrichment was considered significant at adjusted P < 0.05.

Multiplexed single-cell transcriptional response

In total, 21 human cell lines expressing SpCas9 were individually cultured in RPMI-1640 medium supplemented with 10% FBS. Before lentiviral transduction, cell lines were grouped into two pools based on doubling times to ensure balanced growth dynamics. Each pool was transduced with four sgRNAs: a non-targeting cutting control (Chr2-2: GGTGTGCGTATGAAGCAGTGG), two targeting c10riboseqorf92 (sg 1: ACAGGGCACTGGTCTCCCAA; sg 2: CAAGGCTGTATATTTCACCT), and a positive control targeting KIF11 (CAGTATAGACACCACAGTTGG). Each virus was applied at four concentrations to empirically approximate a multiplicity of infection (MOI) of 1. Then, 24 h after infection, cells were selected with puromycin (5 µg ml⁻¹), and the selection medium was refreshed every 48 h. On day 7 after infection, cells were trypsinized; both adherent and suspension fractions were combined, pelleted and resuspended in cell capture buffer. The two pools were then merged, preserving equal representation of each original cell line. Before single-cell library preparation, cell viability (>90%) and total counts were confirmed. The samples were processed by the University of Michigan Advanced Genomics Core using the 10x Genomics Chromium platform with GEM-X On-chip Multiplexing and 3′ Gene Expression v4 chemistry. Target recovery was 5,000 cells per sample (approximately 200 cells per condition), with a sequencing depth of 30,000 reads per cell. Libraries were sequenced on the Illumina NovaSeq system using 10B 300-cycle kits.

Raw sequencing data were processed using Cell Ranger v.9.0.1 with the multi function and aligned to the GRCh38 reference (refdata-gex-GRCh38-2024-A). Cells were filtered using MAD-based outlier detection (5 MADs for log-transformed total counts, log-transformed gene counts and percentage of counts in top 50 genes; 3 MADs for mitochondrial percentage), with a hard mitochondrial cutoff of 20%. Genes detected in fewer than 20 cells were removed. Ambient RNA contamination was corrected using SoupX. Raw counts were normalized to 10,000 counts per cell and log-transformed. Highly variable genes (n = 2,000) were selected using the cell_ranger flavor with batch-aware selection across sample IDs. Principal component analysis was performed using 50 components. Cell cycle scores (S and G2M) were computed using Regev laboratory gene sets. Cell line identity deconvolution was performed using a combination of packages: demuxalot and dropulation, leveraging SNP profiles curated from DepMap (release 25Q2) and CellLineProject and genotype-free approach using scSplit⁸⁶. Cell lines with under 100 cells were filtered out from downstream analysis.

Differential gene expression analysis was performed using a pseudobulk approach. Raw counts were aggregated by cell line and treatment condition using Seurat’s AggregateExpression function, yielding one pseudobulk profile per cell line–condition combination. Low-expression genes were removed using edgeR’s filterByExpr, and library sizes were normalized with trimmed mean of M-values. Differential expression was tested using limma-voom with a design matrix including condition and cell line identity as covariates (~0 + condition + scsplit_assignment). The contrast c10riboseqorf92 KO versus control was tested using moderated t-statistics with empirical Bayes shrinkage (eBayes, trend = TRUE). P values were adjusted using the Benjamini–Hochberg method; genes with adjusted P < 0.05 and |log₂[FC]| > 0.5 were considered significant. Gene set enrichment analysis was performed on log[FC]-ranked gene lists using gseGO from clusterProfiler against GO Biological Process terms (minGSSize = 5, maxGSSize = 500, Benjamini–Hochberg-adjusted P < 0.01).

Gene co-expression modules were identified using the hdWGCNA package in Seurat v5. Metacells were constructed by aggregating cells via k-nearest neighbours (k = 15, max shared = 12, minimum cells = 30), grouped by cell line and treatment condition in UMAP space, and subsequently normalized. Genes expressed in at least 5% of cells were retained for network construction. A signed weighted correlation network was built using a soft-thresholding power selected as the first value achieving a scale-free topology fit (R² ≥ 0.8), with a minimum module size of 30 and a merge cut height of 0.25. Module eigengenes were computed per cell line and condition. For the consensus analysis, metacells from all cell lines were pooled into a single expression matrix and a shared network was constructed. Differential module eigengene (DME) analysis was performed per cell line using the Wilcoxon rank-sum test comparing control to c10riboseqorf92 KO metacells; modules with adjusted P < 0.05 and |average log₂[FC]| > 0.5 were considered treatment responsive. Modules showing significant and directionally consistent responses in three or more cell lines were classified as conserved. Module–trait correlations were computed per cell line against binary trait indicators using ModuleTraitCorrelation. Over-representation analysis was performed on module gene sets using clusterProfiler against GO Biological Process, GO Molecular Function, KEGG, Reactome and MSigDB Hallmark gene sets (Benjamini–Hochberg-adjusted P < 0.05, q < 0.2).

Perturbation distances for each cell line were calculated relative to the unperturbed population (that is, sgRNA Ch2-2) using the definition for E-distance as described previously⁸⁷. Cells were projected into PCA space (15 components), and pairwise energy distances (e-distances) were calculated between all cell line–condition groups, defined as 2× the mean between-group distance minus the mean within-group distance. Same-cell-line comparisons were extracted to quantify the transcriptional shift of each perturbation relative to the non-targeting control (sgRNA Chr2-2) within each genetic background. Hierarchical clustering of the full e-distance matrix was performed using Ward linkage. Finally, consensus non-negative matrix factorization was applied to the harmonized expression matrix to identify latent transcriptional programs. The number of factors (K) was evaluated from 5 to 29, using 5,000 highly variable genes and 50 iterations per K value.

Multiplexed PRM MS of ncORF targets

HEK293, HeLa S3 and K562 cells were obtained from the American Type Culture Collection (ATCC) and checked for mycoplasma prior to assays. HEK293 and K562 cells were cultured in RPMI-1640 medium supplemented with 10% FBS and HeLa cells were cultured in DMEM medium supplemented with 10% FBS. Cells were grown to confluence and were washed in PBS, pelleted and frozen at −80 °C before use.

PRM sample preparation

From each cell line, two equal-sized cell pellets were processed in parallel using two different sample preparation protocols as follows:

Protocol 1: one frozen pellet per cell line was lysed in 8 M guanidine hydrochloride with Tris-HCl (pH 8.5). Cell lysates were incubated at 75 °C for 10 min and homogenized using three 1.4 mm ceramic beads in a Precellys 24 homogenizer (Bertin) at 8,500 rpm 9 times for 20 s with resting for 30 s between cycles. The lysates were centrifuged to remove debris and recover the supernatant. The protein content was determined using the bicinchoninic acid assay (Thermo Fisher Scientific). An aliquot corresponding to 100 µg of protein was reduced with 5 mM tris(2-carboxyethyl)phosphine (TCEP, 20 min, 37 °C) and alkylated with 15 mM iodoacetamide (25 min, room temperature, darkness), with shaking at 900 rpm on a Thermomixer (Eppendorf). The samples were diluted with 0.1 M tetraethylammonium bromide (TEAB, pH 8.5) to ensure a final guanidine hydrochloride concentration of ≤0.5 M. Proteins were digested with trypsin (Trypsin-Gold, Pierce Thermo Fisher Scientific, 4 µg, 37 °C, 900 rpm, 16 h), quenched with 5% formic acid (final concentration), and desalted by solid-phase extraction using Atlas columns (Tecan). In brief, the columns were equilibrated sequentially with 100% acetonitrile, 70% acetonitrile/0.1% trifluoroacetic acid (TFA) and 0.1% TFA in water. The samples were applied, then columns were washed with 0.1% TFA in water, and peptides eluted in 70% acetonitrile and 0.1% TFA. The eluates were dried under centrifugal evaporation (Savant, Thermo Fisher Scientific).

Protocol 2: a second frozen pellet per cell line was subjected to acetonitrile-based small protein extraction (modified protocol for acetonitrile depletion of large proteins^88,89). Frozen cell pellets were resuspended in ice-cold 50 mM NaCl. An aliquot corresponding to 1 mg of protein was precipitated with acetonitrile and TFA for a final concentration of 76% acetonitrile, 0.1% TFA, and 50 mM NaCl, and the samples were vortexed and incubated for 1 h at room temperature with agitation (1,400 rpm). The samples were centrifuged (18,000g, 20 min at room temperature), the supernatants transferred to fresh microcentrifuge tubes and dried under centrifugal evaporation (Savant, Thermo Fisher Scientific). Next, the samples were resuspended in 0.1 M TEAB (pH 8.5), reduced with 5 mM TCEP (20 min, 37 °C) and alkylated with 15 mM iodoacetamide (25 min, room temperature, darkness), with shaking at 900 rpm. Proteins were digested with trypsin (Trypsin-Gold, Thermo Fisher Scientific, 1 µg, 37 °C, 900 rpm, 16 h) on a Thermomixer (Eppendorf). The samples were quenched, desalted with Atlas columns and dried under centrifugal evaporation as described above.

Each of the two samples per cell line was dissolved in 5% acetonitrile, 0.1% formic acid in Milli-Q H₂O. An aliquot of each sample was spiked with ncORF heavy-labelled synthetic peptide analogues and iRT peptides (Biognosys) for MS analysis. Specifically, the samples were spiked so that 1 µg digested cancer cell line peptide mixture, 10 fmol synthetic peptide and 25 fmol iRT peptide were injected onto the column for sample analysis using the Orbitrap Astral mass spectrometer (Thermo Fisher Scientific) and 250 ng digested peptide mixture, 6 fmol ncORF heavy-labelled synthetic peptide and 5 fmol iRT peptide for sample analysis with the ZenoTof 8600 (Sciex).

Synthetic heavy-isotope-labelled peptide standards

For each ncORF peptide, a synthetic peptide analogue was individually chemically synthesized as free amine at the N terminus and carboxylic acid at the C terminus (Synpeptide). Cysteine residues were incorporated as carboxyamidomethylated cysteine building blocks, and one heavy-isotope-labelled amino acid per peptide was incorporated as the terminal arginine as R[13C6, 15N4] or lysine as K[13C6, 15N2]. Each peptide was dissolved in 50% acetonitrile/1% formic acid/Milli-Q water to a concentration of 2 mg ml⁻¹. Peptides were further diluted and individually analysed in 5% acetonitrile/0.1% formic acid/Milli-Q water, and peptides were pooled and then further serially diluted for spiking into the prepared cell line samples.

PRM MS data collection and analysis

PRM data were analysed manually for each peptide and each cancer cell line processed using protocols 1 and 2 using both an Orbitrap Astral (Thermo Fisher Scientific) and ZenoTOF 8600 (Sciex) mass spectrometer with the following conditions:

Orbitrap Astral: peptide concentrations were measured using the Pierce Quantitative Fluorometric Peptide Assay kit (Thermo Fisher Scientific, 23290). The samples were analysed with an Orbitrap Astral mass spectrometer (Thermo Fisher Scientific) interfaced with a Vanquish Neo UHPLC (Thermo Fisher Scientific) equipped with a C18 column EV1109 (Evosep) over a 20 min acetonitrile gradient. The instrument was run in PRM mode with the following parameters: automatic gain controlled at 50%, quadrupole isolation window width (IsoWidth) at 2 Th, maximum injection time for MS2 acquisition (MaxIT) at 20 ms, radio frequency lens voltage (RFLens) at 45, and collision energy (CE) at 28%. Data acquisition was performed using a precursor list targeting both the endogenous and synthetic heavy labelled peptides. The Xcalibur 4.3 acquisition software was used.

ZenoTOF 8600: The samples were analysed using the ZenoTOF 8600 system (Sciex) equipped with an OptiFlow Pro ion source with nano interface (Sciex) and an ACQUITY UPLC M-Class system (Waters) in nanoflow mode. The system was operated with 99.9% water, 0.1% formic acid (v/v, buffer A), and 99.9% acetonitrile, 0.1% formic acid (v/v, buffer B). Peptides were loaded for 15 min at 2% B (loop load) and separated on a 15 cm × 75 μm inner diameter, 1.7 µm C18 Aurora Elite column (IonOptiks) using a stepwise gradient with 15–105 min from 2% to 35% B, 105–108 min from 35% to 45% B, 108–109 min from 45% to 80% B, 109–114 min held at 80% B, 114–115 min from 80% to 2% B, held to 135 min at 2% B at 250 nl min⁻¹. The source parameters were as follows: nano interface temperature, 240 °C; nano nebulizer gas, 15 psi; curtain gas, 35 psi; nano spray voltage, 2,300 V. Data were acquired with a TOF-MS scan in the m/z range of 200–2,000 Da with 200 ms accumulation time and QJet DP of 70, followed by MRMHR (PRM) with a TOF-MS/MS scan in the m/z range of 100–1,750 Da, Q1 resolution of Unit (0.7 ± 0.1 Da), Zeno pulsing on (on-demand), Zeno threshold of 1 × 10⁶ cps using a MRMHR precursor list targeting the endogenous and synthetic heavy labelled peptides. Acquisition software SCIEX OS 4.0 was used.

For analysis and positive detection of the endogenous ncORF peptide above signal to noise Skyline (64-bit, v.25.1.0.142)⁹⁰ was used.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.