Spatial distribution of the proteome in the human body and in cancers

June 18, 2026

32

Collection of tissue samples

The autopsy samples involved in this study were collected from the body donation centre of Dalian Medical University and Shanghai Jiao Tong University. The post-mortem interval was within 10 h. The project was approved by the Ethics Committee of Dalian Medical University (institutional review board (IRB): Dalian Medical 2019-05) and Shanghai Jiao Tong University (IRB: 2022-A-01, 2021-03). Written informed consent was obtained from the legally authorized representatives for all donors. To ensure anatomical accuracy, we used (1) macrodissection by two experienced anatomists (included in the author list) following a standardized protocol; and (2) histopathological validation through H&E-stained formalin-fixed paraffin-embedded (FFPE) sections reviewed independently by two pathologists (included in the author list) for tissue types that might not be clearly partitioned at the anatomical level.

Paired tumours and non-tumour samples from surgery were collected from Harbin Medical University Cancer Hospital (from December 2019 to December 2022) and Huashan Hospital (from January 2018 to December 2018). The project was approved by the Ethics Committee of Harbin Medical University Cancer Hospital (IRB: KY2019-08, KY2023-03 and KY2019-16) and Huashan Hospital (KY2021-064). Written informed consent for each patient was collected. The samples mentioned above were stored in FFPE format. We systematically compared the cancer types, number and types of samples of our pan-cancer cohort with those in TCGA¹⁵ and CPTAC⁴⁹. According to the NCI Genomic Data Commons⁵⁰, our study includes two rare malignancies not covered by either TCGA or CPTAC: GIST and fallopian tube carcinoma. Conversely, TCGA and CPTAC include several cancer types absent in our current cohort, including adrenal cortical carcinoma, bladder cancer, leukaemia, mesothelioma, skin cancers and melanoma. As regards sample size, our study profiles approximately 40 patients per cancer type (Supplementary Table 1), whereas TCGA typically includes around 200 cases per type and CPTAC includes about 160. With respect to normal control tissues, TCGA mainly uses blood samples for germline genomic analysis, whereas CPTAC often includes matched adjacent non-tumour tissues (either fully or partially paired). In our pan-cancer cohort, we incorporated matched adjacent non-tumour tissues for most patients, allowing tumour–non-tumour comparisons at the proteome level.

Autopsy specimens of fetuses, from abortions, were collected from Shenzhen Baoan District Maternal and Child Health Hospital (during 16 June 2020 to 16 September 2020). The post-mortem interval was within 10 h. The project was approved by the Ethics Committee of Shenzhen Baoan District Maternal and Child Health Hospital (IRB: LLSCHY 2019-10-36). Written informed consent was obtained from all donors of tissues. The fetal samples were stored in formalin at 4 °C before use.

Sample preparation for LC–MS

The sample preparation process was similar to that described previously^51,52,53. In brief, we punched each FFPE sample and weighed about 1.0 mg. Heptane was used for dewaxing, gradient (100%, 90%, 75%) ethanol for hydration and 0.1% formic acid (FA) for acid hydrolysis. The obtained product was then placed in pressure cycling technology (PCT) microtubes, and Tris-HCl (pH 10, freshly prepared) was added at 95 °C for 30 min for alkaline hydrolysis. After rapid cooling, lysis buffer (6 M urea, 2 M thiourea), Tris(2-carboxyethyl)phosphine (TCEP) and iodoacetamide (IAA) were added for reductive alkylation. PCT-assisted lysis was performed in the Barocycler NEP2320-Enhanced (Pressure BioSciences) for 90 cycles at 30 °C, with each cycle consisting of 30 s at high pressure (45,000 psi/~310 MPa) followed by 10 s at ambient pressure (AP). LysC (enzymatic substrate concentration ratio 1:80, Hualishi Tech) and trypsin (enzymatic substrate concentration ratio 1:20, sequencing grade, Hualishi Tech) were then added for digestion. The process was conducted with PCT for 120 cycles at 30 °C, with each cycle consisting of 50 s at high pressure (20,000 psi/~138 MPa) followed by 10 s at AP. After the enzymatic hydrolysis process was terminated by 10% trifluoroacetic acid (TFA), we used C18 spin columns (The Nest Group) for desalination.

The dried samples were resuspended with buffer A (high-performance liquid chromatography (HPLC)-grade water, pH 10). The fractionation was performed using an UltiMate 3000 RSLCnano system (Thermo Fisher Scientific) with an HPLC C18 column (diameter 4.6 mm, length 25 cm, particle size 300 Å). The flow rate was 0.5 μl per min. The gradient was 5% to 95% buffer B (98% acetonitrile (ACN), pH 10) in 60 min. A total of 60 fractions were collected. Then, 60 fractionations were combined into 10. The combined samples were evaporated using vacuum centrifugation (CentriVap, Labconco) at 45 °C.

For bone tissues, we cut a piece approximately one-fifth of the volume of a 2 ml centrifuge tube. The teeth were shattered with a clean hammer and placed into the 2 ml tube. After the dewaxing and alkaline hydrolysis described above, we added lysis buffer and grinding beads (about one-third of the volume of the 2 ml tube). The sample was frozen in liquid nitrogen, ground in a tissue grinder for 1 min and refrozen in liquid nitrogen. This procedure was repeated three to five times until the sample turned into a bone slurry (except for the teeth). We then added 1 ml of 10% FA solution and incubated the sample at 4 °C overnight for digestion. Afterwards, approximately 4 mg of the insoluble wet weight was transferred to a PCT tube, and 150 µl Tris-HCl (pH 10) was added to neutralize any excess acid, followed by removal of the supernatant. Then, 20 µl Tris-HCl (pH 10) and 30 µl 6 M urea–2 M thiourea lysis buffer were added, and subsequent PCT-assisted lysis and enzymatic digestion were performed as described above.

Hair samples (4 cm) or nail samples (1 mg) were immersed in 50% methanol or ethanol, chopped, and vortexed, then transferred to a PCT tube. To the tube, 30 µl lysis buffer (30% trifluoroethanol, Tris-HCl buffer, pH 8) and 2.5 µl of 200 mM TCEP solution were added, and the tube was sealed with a PCT pestle and placed in a PCT device. The samples underwent 180 cycles in the Barocycler (50 s at 45,000 psi (~310 MPa), 10 s at atmospheric pressure, 70 °C). After cooling to room temperature, 2.5 µl of 800 mM IAA solution was added, and the samples were incubated in the dark at room temperature for 30 min with gentle mixing (800 rpm). After dilution with 150 µl Tris-HCl buffer (pH 8), 1.25 µg of LysC and 5 µg of trypsin were added, and the tube was sealed and subjected to 120 cycles in the PCT (20,000 psi (~138 MPa) for 50 s, atmospheric pressure for 10 s, 30 °C). After the cycles, 15 µl of 10% TFA solution was added to stop the digestion, and the samples were desalted using C18 tips.

Whole blood was collected by venipuncture into commercial EDTA-containing sampling containers. It was separated into PRP1, erythrocyte, platelet-rich plasma (PRP), platelet-free plasma (PPP) and the pure platelet fraction, according to a previously published article⁵⁴. PRP1 was the supernatant in the first centrifuge at 200g for 10 min. PRP was the supernatant in the centrifuge of PRP1 at 200g for 10 min. Then, 1 ml whole blood was transferred from the anticoagulation tube into a 50 ml centrifuge tube, 10 ml 1× red blood cell lysate (Invitrogen, 00-4333) was added for 10 min at room temperature (strictly controlled time) and 25 ml phosphate-buffered saline (PBS; total volume 36 ml) was added to stop the reaction. It was then divided into 15 ml centrifuge tubes (9 ml in each tube) and centrifuged at 500g for 15 min at 4 °C, followed by aspirating and discarding the supernatant. At this time, 200 μl was allowed to remain in each 15 ml centrifuge tube. The remaining liquid and cell pellet from each tube were pipetted into a 1.5 ml centrifuge tube, which was centrifuged at 500g for 5 min at 4 °C, followed by aspirating the supernatant as much as possible, to leave the cell pellet. The cell pellet was washed with 1× PBS three times and centrifuged (500g for 5 min, 4 °C) to get a clean white blood cell pellet.

The first stage of the process was the depletion of the top 14 proteins in the blood, PPP, PRP, and PRP1: samples taken out from storage at −80 °C were immediately put on ice. The depletion column was equilibrated to room temperature, following the manufacturer’s instructions (Thermo Fisher Scientific, High-Select Top14 Abundant Protein Depletion Resin, A36371). The second stage was protein concentration, following the manufacturer’s instructions (Thermo Fisher Scientific, Pierce Protein Concentrators PES, 3K MWCO, 0.5 ml, 88512). Then, 500 μl of 8 M urea–100 mM ammonium bicarbonate was added to the concentrator, which was spun at 12,000g for 40 min. Finally, the remaining volume was adjusted to 50 µl. The third stage of the process was protein digestion, which was done as described in a previous study⁵⁵; however, the TCEP and IAA processes were both 40 min. Then, the protein extracts were digested with LysC and trypsin to the sample at 1:50. Digested peptides were cleaned using C18 (Thermo Fisher Scientific, 60209-001). The fourth stage was peptide fractionation, as described previously⁵⁶. In brief, peptides were separated into 122 fractions, which were consolidated into 10 fractions. The samples were separated at a flow rate of 1 ml per min using a gradient from 5% ACN in 10 mM ammonia (pH 10.0) to 8% in 7.5 min; the percentage was increased to 18% in 30 min and to 32% in 55 min, then to 95% in 61 min; this 95% composition was held for 6 min; and finally it was dropped to 5% in 68 min. The fractions were combined according to the following strategy: (1) combine the 1st to 6th, 77th to 82nd and 97th, 98th, 107th and 108th fractions; (2) combine the 7th to 10th, 73rd to 76th, 93rd, 94th, 111st and 112nd fractions; (3) combine the 11th to 14th, 69th to 72nd, 89th, 90th, 115th and 116th fractions; (4) combine the 15th to 18th, 65th to 68th, 83rd, 84th, 121st and 122nd fractions; (5) combine the 19th to 22nd, 61st to 64th, 99th, 100th, 105th and 106th fractions; (6) combine the 23rd to 26th, 57th to 60th, 95th, 96th, 109th and 110th fractions; (7) combine the 27th to 30th, 53rd to 56th, 91st, 92nd, 113rd and 114th fractions; (8) combine the 31st to 34th, 49th to 52nd, 85th, 86th, 119th and 120th fractions; (9) combine the 35th to 38th, 45th to 48th, 101st, 102nd, 103rd and 104th fractions; (10) combine the 39th to 44th, 87th, 88th, 117th and 118th fractions. The combined fractions were subsequently dried and redissolved in 2% ACN/0.1% FA, and then analysed by liquid chromatography–mass spectrometry (LC–MS) (Bruker, timsTOF Pro).

For the collection of saliva, we strictly controlled the conditions to minimize contamination from food debris and oral bleeding, according to the instructions from Salimetrics (https://salimetrics.com/saliva-collection-handbook/). In brief, we collected about 1 ml of saliva per person from four healthy participants before eating meals and at least 45 min after brushing teeth, keeping records of the start and end times. Participants had to avoid the following before saliva collection: strenuous exercise, oral issues, intake of a high-sugar diet, intake of a highly sour diet, coffee, alcohol, nicotine and drugs. We collected 1 ml of the middle part of the morning urine. The saliva and urine samples were stored in at −80 °C before use. The reduction, alkylation, digestion, and desalting steps were performed as described for plasma samples in a previous study⁵⁵.

LC–MS/MS analysis

The peptides redissolved in 98% water:2% ACN:0.1% FA (v/v/v) were separated and analysed by a coupled ultra-high-performance liquid chromatography (UHPLC)–trapped ion mobility spectrometry–tandem mass spectrometry (MS/MS) system. For each acquisition, the peptides were loaded onto a Thermo Fisher Scientific Trap Cartridge (100 Å 5.0 μm, 0.3 mm × 5 mm) and separated by an in house-made analytical column (120 Å 1.9 μm, 150 mm × 0.075 mm) with a Bruker Daltonics nanoElute UHPLC system using a 95 min liquid chromatography gradient at a flow rate of 0.3 μl per min. The composition of the gradient was as follows: linearly increased from 2% B to 22% B in the first 80 min, then from 22% B to 35% B in 10 min, and then from 35% B to 80% B for 2 min and maintained at 80% B for the last 3 min. The mobile phase A was 99.9% water:0.1% FA (v/v), whereas mobile phase B was 99.9% acetonitrile:0.1% FA (v/v). All reagents were MS grade.

A Bruker Daltonics trapped ion mobility quadrupole time-of-flight mass spectrometer (timsTOF Pro) was used to perform data acquisition. In both data-dependent acquisition (DDA) and DIA mode, the MS full scans were acquired over an m/z range of 100 to 1,700 and an ion mobility range of 0.60 to 1.60 V per s per cm². The ion mobility over m/z heat maps were filtered using an inclusion–exclusion polygon region with vertices (150, 0.6), (1,300, 1.6), (1,700, 0.6), and (1,700, 1.6). In DDA parallel accumulation–serial fragmentation (ddaPASEF) mode, the method consisted of an MS full scan followed by 10 PASEF MS/MS scans, which consumed 1.17 s of total cycle time. The accumulation time and ramp time were both 100 ms. The target intensity was 20,000, and the intensity threshold was 2,500. In DIA parallel accumulation–serial fragmentation (diaPASEF) mode, the PASEF MS/MS scans were collected from 64 windows in an m/z range of 400 to 1,200 and corresponding ion mobility range of 0.57 to 1.47 V per s per cm², adapted from the standard 16 diaPASEF scans scheme⁵⁷.

Database searching and library generation

For generating consensus PASEF spectral libraries, we used the FragPipe^58,59,60,61 computational platform (v.18) with MSFragger^60,61 (v.3.5), Philosopher⁵⁹ (v.4.2.2) components and the EasyPQP (https://github.com/grosenberger/easypqp/) (v.0.1.29) Python package. Our workflow was based on the FragPipe DIA_SpecLib_Quant workflow⁵⁸, modified accordingly.

The database searching of raw files was performed by the MSFragger search engine, referring to the UniProtKB/Swiss-Prot Homo sapiens proteome ID UP000005640 (containing 20,377 canonical protein sequences; downloaded on 7 May 2020), appended with reverse protein sequences as decoys generated by the built-in Philosopher decoys generation tool. Then, Philosopher was used for further evaluation, protein inference and two-dimensional (2D) FDR filtering with a threshold of 0.01; the picked FDR algorithm was used to scale FDRs for large datasets. The built-in library generation script gen_con_spec_lib.py coupled with EasyPQP was used to build consensus PASEF spectral libraries with 0.01 FDR control on both the peptide level and the protein level. Most of the parameters were maintained as reported previously⁶², except for the application of the semi-enzymatic tryptic digestion rule, a 0.05 Da fragment mass tolerance and retention time (RT) alignment based on spiked indexed RT (iRT) peptides. Thus, each peptide was identified either as a unique peptide to a specific protein (or protein group) or as a razor peptide to the protein (or protein group) with the most peptide evidence. Correspondingly, each protein (or protein group) was identified by either a unique peptide or a razor peptide detected for it. In downstream DIA analysis, protein groups containing indistinguishable proteins were excluded from the quantification reports.

Spectral library evaluation

The information on precursors, fragment ions and modifications in the spectral library was analysed to evaluate its quality according to an algorithm modified from the published DIALib-QC tool⁶³. The Perl code for visualization was re-implemented in R. The ratios of sequence length, ion charge, peptide modification, fragment ion type and missed cleavage were counted, respectively, and presented in bar charts. The distributions of precursor m/z and peptide sequence coverage were displayed as histograms, and the protein sequence database read-in was performed by the seqinR⁶⁴ package. Pearson’s correlation of iRT values between +2 and +3 ion charge states of the same peptide sequence was measured (R²), and the 2D kernel density estimation was performed using the function stat_density2d in the ggplot2 package.

We further compared the proteins and peptides in our spectral library with PeptideAtlas⁶⁵, which consisted of 16,702 proteins and 2,701,150 peptides (as of 6 July 2021), and ProteomicsDB⁶⁶, which covered 13,553 proteins and 913,326 peptides (as of 6 July 2021). The comparison results were presented as a Venn diagram generated by an R script.

To rigorously assess and control false discoveries in the DIA analysis, we applied a combined entrapment strategy⁶⁷. We constructed a spectral library that merged empirical human spectra from the project-specific DDA library with entrapment spectra generated from non-human species⁶⁸. This design produced the desired ratio of entrapment-to-original target entries at both levels: 16.714 for protein groups (256,259 entrapments versus 15,332 targets) and 1.072 for precursors (739,318 entrapments versus 689,568 targets).

Analysis of DIA data by DIA-NN

DIA searching was performed by DIA-NN (v.2.2.0 Academia) with the FragPipe-generated spectral library, with its existing protein inference results⁶². Both the MS1 and the MS2 mass accuracy were set as 15 ppm, and a robust liquid chromatography (high-precision) quantification strategy and RT-dependent cross-run normalization were chosen for quantitative results. Moreover, the different runs were treated as unrelated with match-between-runs disabled. For a shorter waiting time, the 2,856 sample runs, including replicates, and 149 pooled sample runs were randomly grouped and searched on two computers in parallel. The 3,005 runs were reanalysed with the generated .quant files in a single run to generate the final report.

By default, DIA-NN iteratively selects the best elution peak per precursor using target decoy-trained classifiers, quantifies each precursor by summing the top six fragment intensities and computes protein group intensities from both proteotypic and shared precursors passing the specified q value thresholds, with PG.Q.Values estimated using all precursors assigned to each Protein.Group and Protein.Q.Values computed using only proteotypic precursors assigned to the corresponding genes. To increase stringency, we retained only proteins supported by unique peptides (that is, protein groups without ‘indistinguishable’ members in the FragPipe-annotated DDA library), and only proteotypic peptides (proteotypic = 1 in the DIA-NN report).

For downstream quantification and statistical analyses, we filtered the DIA-NN main report using enforced stringent FDR thresholds based on a series of false discovery proportion (FDP) estimates. The FDP values among the targets and entrapment discoveries (T ⋃ E_T) were calculated using the equation below:

$$\hat\rmF\rmD\rmP_T\cup E_\rmT=\fracN_\rmE(1+1/r)N_\rmT+N_\rmE,$$

where N_T is the number of target discoveries, N_E is the number of entrapment discoveries and r is the entrapment-to-target ratio in the spectral library.

To select suitable q value thresholds, we evaluated precursor and global protein-group q value cut-offs across the range 1 × 10⁻⁴ to 0.01, while keeping the run-specific protein-group q value fixed at 0.01. To balance sensitivity and confidence, we finally adopted the following criteria: proteotypic = 1, Q.Value < 0.001, PG.Q.Value < 0.01, Global.Q.Value < 0.001 and Global.PG.Q.Value < 0.001. These settings ensured tight global control of the FDP across all reporting levels in our multi-batch DIA dataset.

Quality control and preprocessing

We used both biological replicates (independent processing of aliquots from the same tissue sample) and technical replicates (repeated injections of identical peptide samples), with random assignment to minimize bias. Samples were randomized by reshuffling each collection batch, and clinical metadata were blinded during processing. Quality control used a dual approach: a project-specific pool prepared by mixing equal amounts of peptides from early-collected samples, and a universal standard (mouse liver digest) monitored by specialized MS administrators. Both quality control samples were injected before each batch to monitor instrument stability. All samples, including project-specific pool samples, were analysed by DIA-NN and preprocessed together (3,005 raw files in total).

Before quality control analysis, proteins with 50% or greater missing values across all the tissue types were removed from the protein matrix. The remaining protein-level intensities were quantile-normalized across samples using the normalize.quantiles function in the preprocessCore⁶⁹ package (v.1.58.0), followed by log₂ transformation. Missing values were then imputed using 0.5× the minimal value in the whole matrix. Batch effects were estimated using principal variance component analysis and corrected using the removeBatchEffect function provided by the limma⁷⁰ package (v.3.52.4) in two steps. Unless stated otherwise, this corrected matrix is used in all subsequent analyses.

Consistent processes were implemented for the corresponding data from pooled peptide samples to calculate the coefficients of variation (CVs) of protein abundances and Pearson correlation coefficients. For both technical and biological replicates, CVs were calculated by the same method that was used for pooled peptide samples, and Pearson’s r was calculated in all pairs (combined number). Similar quality control analysis was subsequently performed with the corrected matrix.

Protein identification and quantification

The number of proteins in tissue-specific libraries and DIA-quantified peptides and proteins were visualized in circular bar charts using the canvasXpress R package (v.1.29.6) and retouched using Adobe Illustrator.

t-SNE visualization

The whole matrix, excluding pooled samples, was standardized and dimensionality-reduced using t-SNE with the Rtsne⁷¹ package (v.0.16) in R. The perplexity parameter was set to 10. The 2D result was visualized in a scatter plot. To visualize group-level spread, we overlaid data ellipses based on a multivariate t-distribution for each sample type using stat_ellipse() from ggplot2 (v.4.0.1) in R.

Trajectory analysis of sample types

Trajectory analysis was performed using Monocle3 (v.1.4.26)⁷² to reconstruct the pseudotemporal ordering of samples on the basis of their proteomic profiles. After removing hair, nail and body fluid samples, the data were first processed with principal component analysis and then embedded with uniform manifold approximation and projection (UMAP) for dimensionality reduction. The trajectory graph was generated without partitions and without loop closing, with fetal samples set as the starting point of the trajectory, and pseudotime values were obtained for all samples.

To identify proteins whose expression changed along the progression of sample types, we applied a one-way ANOVA with FDR adjustment (P_adj < 0.05) and removed background contaminants. For these significant proteins, we computed their median intensities across the four sample types (F, T, NT and N) and performed k-means clustering (k = 4–8) using the ClusterGVis⁷³ package (v.0.1.4). Each cluster was subsequently annotated through Gene Ontology: Biological Process (GOBP) enrichment analysis, and the top five enriched terms for each cluster were retained for display purposes.

Tissue heterogeneity analysis

The proteomic heterogeneity of normal samples was assessed using a preprocessed and quality-controlled protein expression matrix. Samples were hierarchically classified by anatomical classification (major category) and specific tissue name (subcategory). We calculated pairwise Pearson correlations (r) and Euclidean distances for all samples using the ‘stats’ package (v.4.4.1). We summarized the distribution of these metrics across combinations of two major categories and two subcategories (Supplementary Table 4). Tissue heterogeneity was further investigated by ranking median distances of major tissue types and subcategory pairs within the same major tissue type. Statistical differences in median distances and correlations across these hierarchical groups were evaluated using robust non-parametric comparisons in the ggstatsplot package (v.0.13.6). Subcategories with fewer than two samples were excluded to ensure statistical rigour. Global heterogeneity trends were visualized using multi-panel violin plots, and point-range distributions were generated with ggplot2 (v.4.0.1).

H&E staining

Sections of 7 μm thickness were cut using a HistoCore AUTOCUT microtome (Leica Biosystems) and mounted on positively charged glass slides. Slides were stained with H&E using the HistoCore SPECTRA Automated Slide Staining Workstation (Leica Biosystems) according to the manufacturer’s standard protocol. In brief, sections were deparaffinized, rehydrated through graded alcohols, stained with haematoxylin to visualize nuclei, differentiated, blued and counterstained with eosin to visualize cytoplasm and extracellular matrix. After staining, sections were dehydrated through ascending ethanol concentrations, cleared in xylene and coverslipped automatically. Stained sections were examined and photographed using a Zeiss Axio Observer 3. Images were captured at 40× magnification and processed by ZEN Blue (v.3.4).

Comparison with previous human transcriptome and proteome draft

For comparison with previous human proteome drafts, we used Ensembl⁷⁴ stable gene IDs (v.GRCh37.p13) as our primary mapping strategy, ensuring consistency across data versions and minimizing gene-symbol ambiguity. We exclusively used canonical reviewed proteins from UniProt⁷⁵, mapped to stable gene IDs by BioMart⁷⁶, thereby eliminating isoform mismatch issues. For transcriptomic integration, we retrieved the HPA-normalized transcripts per million (TPM) matrix and associated tissue specificity annotations.

To address quantification differences, we implemented two complementary strategies: (1) z-score normalization of overlapped proteins before Pearson correlation analysis for cross-study expression comparisons; and (2) comparison of tissue specificity labels derived from independent multi-tissue proteome analyses within each study, following established methodology.

To assess the consistency of tissue specificity annotations across independent datasets, we manually aligned the anatomical classification labels used in our data to the tissue nomenclatures used in previous studies^3,4,14. We specifically cross-referenced tissue-enriched and group-enriched protein categories for corresponding tissues between datasets. For correlation analyses, the HPA³ and Wang et al.¹⁴ expression matrices were z-score-normalized. Because the GTEx⁴ dataset uses a tandem mass tag-based ratio matrix, we performed median normalization before applying z-score normalization. From these standardized matrices, we calculated pairwise Pearson correlation coefficients to quantify inter-study proteomic and transcriptomic consistency.

Tissue specificity analysis

The tissue specificity analysis was only performed on healthy adult autopsy specimens, excluding pan-cancer specimens, body fluids, hair, nail, umbilical cord and aborted fetus, using the modified HPA classification method³. The protein abundances of each tissue type were aggregated by the median of all samples after imputing missing values to 0.5× the minimal value in the whole matrix. We changed the categories of tissue-enriched, group-enriched and tissue-enhanced proteins from fivefold higher abundance to threefold for our more varied tissue types.

Tissue-enriched proteins were extracted for each tissue type and used to compare functional profiles based on GOBP annotations. Enrichment analysis was performed using the clusterProfiler⁷⁷ package (v.4.12.6) in R. The gene sets with genes between 10 and 500 were selected, with a q value threshold of 0.05 and a minimum protein count of 3. In the bubble plot, the top five most significant tissue types were highlighted for their uniquely enriched pathways. Network relationships among enriched pathways were generated using the enrichmentNetwork function in the aPEAR⁷⁸ package (v.1.0), applying a Jaccard similarity matrix and hierarchical clustering to identify pathway clusters, and selecting the term with the minimal q value as the cluster name.

Differential expression analysis

Differential expression was analysed separately for each cancer type using only patients with paired tumour and adjacent non-tumour samples. Given the limited sample size per cancer type, inherent tumour heterogeneity and potential confounding from technical batch effects, we applied a linear mixed model for comparative analysis between paired T and NT samples for each cancer type. Batch effects and clinical factors (sex, age and subtype) were explicitly included as covariates in differential expression analyses to ensure robustness, with standardized effects calculated after Hedges’ small-sample correction⁷⁹. This dual strategy enhances the reliability of biological signal detection while minimizing false positives attributable to technical variation^80,81,82. We removed proteins with missing values in more than 50% of samples within each cancer type, and imputed the remaining missing values under a left-censoring (limit-of-detection) assumption. Missing values were replaced using a global intensity floor multiplied by a random noise term drawn from a narrow normal distribution. Specifically, the imputed value I_imp was calculated as:

$$I_\rmimp=I_\rmglobal,\min \times 0.5\times \epsilon ,$$

where I_global,min is the minimum intensity observed across the cancer dataset (the global floor), and ϵ represents a random noise term drawn from a narrow normal distribution with a mean of 1 and a s.d. of 0.001.

After log₂ transformation, we performed differential expression analysis for each protein by fitting a linear mixed model using the lmerTest⁸³ package (v.3.1.3), in which the log-scaled intensity was the outcome. The model included sample type (tumour versus non-tumour) as the primary fixed effect and a patient-specific random intercept to account for the paired sample structure. Additional fixed covariates (cancer subtype, sex, age and dataset) were included only if supported by the available data. The tumour–non-tumour contrast was obtained from the coefficient of the sample-type term ($\hat\beta _1$). To quantify the bias-corrected standardized effect size, the Hedges’ g statistic⁸⁴ was computed as:

$$g=\frac\hat\beta _1\hat\sigma \times J(\nu ),$$

where $\hat\sigma $ is the fitted model’s residual s.d., ν denotes the degrees of freedom and J(ν) is the small-sample bias correction factor defined by the Gamma function (Γ):

$$J(\nu )=\frac\Gamma (\nu /2)\sqrt\nu /2\,\Gamma ((\nu -1)/2).$$

The P values were adjusted for multiple testing within each cancer type using the Benjamini–Hochberg procedure, and proteins were considered significantly DEPs if they satisfied P_adj < 0.05 and |Hedges’ g| ≥ 0.5. Proteins exhibiting statistically significant up- or downregulation in a single tumour type were classified as tumour-specific DEPs. A single protein could be categorized as a tumour-specific DEP in more than one cancer type, exhibiting different regulation patterns (up- or downregulation) across these distinct tumour types. The DEPs corresponding to RTKs were standardized and shown in a heat map with hierarchical clustering. The DEPs were also mapped to TCGA-curated pathways⁸⁵, and the degree of pathway dysregulation was assessed by the protein count for each pathway in each cancer type. Here, we present the pseudocode for this analysis:

Input: Proteomics matrix X, sample metadata M
Output: Per-cancer results R = (c, j, q, g_adj)
FUNCTION EstimableCovariates (M_c)

V ← ∅

FOR EACH covariate v ∈ cancer_subtype, sex, age_centered, dataset:
IF |unique values of v in M_c | ≥ 2 THEN V ← V ∪ v ENDIF
RETURN V
ENDFUNCTION
FOR EACH cancer type c:
1. Subset X_c, M_c to samples where cancer_type = c
2. Retain only complete tumour–normal pairs (indexed by patient_ID);
let n = number of pairs
3. Center age: age_c ← age − mean (age)
4. V ← EstimableCovariates (M_c)
FOR EACH protein j ∈ 1, …, P:
5. Fit linear mixed-effects model via REML:
y_ij = β₀ + β × sample_type_i + γTV_i + u_j + ε_ij
where u_j ~ N(0, τ²) (patient random intercept),
ε_ij ~ N(0, σ²),
sample_type coded as ordered factor normal < tumour
6. If model fails to converge, skip protein j
7. Extract from the tumour coefficient:

β^, SE, t, df, p

8. Compute bias-corrected effect size:

$g_\mathrmadj=\rmJ(\mathrmdf)\times (\hat\beta /\hat\sigma )$ ▷ Hedges’ g

ENDFOR
9. Apply Benjamini–Hochberg FDR correction to p_j within c → q_j
10. Append c, j, g_adj, q_j to R
ENDFOR

Consensus clustering analysis

The pan-cancer data were extracted from the corrected protein matrix, and subsequently preprocessed by selecting proteins whose CV values were larger than the median across all samples. Consensus clustering was then performed using the ConsensusClusterPlus⁸⁶ package (v.1.67.0) with the k-means algorithm (k = 2–20) based on Euclidean distance. Silhouette analysis was applied to remove samples with silhouette widths less than 0. The remaining ‘core’ samples were subsequently used for differential protein expression analysis for each cluster versus all others using the Wilcoxon rank-sum test, which identified significantly upregulated and downregulated proteins (|log₂ fold change| ≥ 1.5, P_adj < 0.05). Cluster-enriched proteins were also computed using the tissue specificity classification algorithm. The final protein signature set was derived from either significant proteins or cluster-enriched proteins, and was used for single-sample gene set enrichment analysis (ssGSEA⁸⁷) against the MSigDB Hallmark gene sets⁸⁸ using the GSVA⁸⁹ package (v.1.51.1). The resulting pathway enrichment scores were statistically compared between the patient clusters using the limma framework.

Tumour-enriched proteins

Tumour-enriched proteins were defined for each tumour type as proteins significantly upregulated in the tumour (T) compared with both paired non-tumour (NT) and a comprehensive panel of normal (healthy) tissues (N) from diverse organs. The T–NT comparisons were taken from the ‘Differential expression analysis’ section in the Methods. The T–N differences were computed using a linear mixed model similar to the T–NT analysis, but including only the sample collection batch as an additional fixed covariate, because the corrected matrix was used here. To ensure stringent tumour enrichment, the maximum intensity across all of the N samples was required to be lower than the 25th percentile intensity in the T samples, and the highest median intensity across the N samples was required to be less than half of the median intensity in the T samples.

Locally enriched DEPs

The list of tumour-specific DEPs was mapped to tissue-enriched proteins from normal samples. Locally enriched DEPs were defined as the intersection of them.

External database cross-referencing

The list of cancer dysregulated proteins was mapped to the DrugBank online database⁹⁰ to find out potential targeted drugs, and the cancer–target–drug combinations were used to query clinical trials information from the ClinicalTrials.gov online database (https://clinicaltrials.gov). We downloaded 8,179 DrugBank records (6,653 targets; 7 February 2023), filtering for 528 biotech drugs targeting 573 proteins. Among these targets, 301 were quantified in our paired tumour dysregulation analysis, 260 of which were DEPs. By matching these drugs to ClinicalTrials.gov records for cancer types in which their targets were upregulated, we identified 36 drugs linked to 2,084 trials involving 77 upregulated DEPs across 10 cancer types (Supplementary Information). All matched clinical trials were visualized in a circular scatter plot.

We used the pan-cancer proteomic map (ProCan-DepMapSanger) drug response and CRISPR–Cas9 gene-essentiality screens dataset⁴⁴ as a validation. In the ProCan-DepMapSanger dataset, linear regression analysis was performed to test the associations between protein abundances and either drug sensitivity or CRISPR–Cas9 gene dependency. Calculated nc_fdr and nc_beta denote the FDR and the corresponding regression coefficient from models, respectively. We filtered the protein–drug or protein–knocked gene associations of nc_fdr <0.01 and nc_beta < −0.1. Meanwhile, we selected the association pairs where the protein functions as the target of the drug or corresponds to the CRISPR perturbed gene (PPI = TRUE). The gene names were transformed into Swiss-Prot protein IDs and mapped to our list of cancer dysregulated proteins. The beta without-transcriptomics covariate, Hedges’ g and cancer of each validated protein–drug relationship were visualized in a scatter plot.

The favourable prognostic, unfavourable prognostic and cancer-enriched genes, from 14 types of human cancer (head and neck cancer, thyroid cancer, lung cancer, liver cancer, testis cancer, prostate cancer, stomach cancer, colorectal cancer, breast cancer, endometrial cancer, ovarian cancer, cervical cancer, pancreatic cancer and renal cancer), were obtained from the Human Pathology Atlas⁴⁰ (https://www.proteinatlas.org/humanproteome/cancer). Our list of dysregulated proteins in cancer was mapped to the above public datasets for the same cancer types and visualized using a Sankey diagram.

The list of cancer dysregulated proteins was also mapped to TCGA and CPTAC datasets by proteins and corresponding cancer types, with the results displayed as scatter plots.

Parallel reaction monitoring validation

A total of 137 cancer-specific dysregulated proteins were selected from pan-cancer data acquired before the year 2022 for parallel reaction monitoring (PRM) validation. Each selected precursor has no modification and no missed cleavage, with the peptide length ranging from 8 to 20, identified using Skyline (v.21.1). In total, 56 peptide precursors from 47 proteins and 9 iRT peptide precursors were analysed. Among the 47 cancer-specific proteins, 10 were still classified as cancer-specific. Skyline⁹¹ was used to analyse and quantify the targeted PRM data before generating the peptide matrix. The peptide matrix was transformed into the protein matrix using the online ProteomeExpert Peptide2Protein tool⁹² by means of the top three precursor intensities and with log₂ transformation and quantile normalization. Differential expression analysis was performed after filling missing values with 0.5 times the minimum value in the whole matrix. DEPs were identified as proteins with a B-H adjusted P < 0.05 (paired Wilcoxon rank-sum test) and an absolute log₂ fold change of means ≥ 1. The cancer-enriched proteins and cancer-specific dysregulated proteins were selected using the methods described above. Validated proteins were visualized in box plots.

Validation of protein detection

Peptides were synthesized for PANX3 (SLAHTAAEYMLSDALLPDR, LVQHMLK and YFEFPLLER) by Jietai Biotechnology (http://www.synpeptide.cn/). These peptides were redissolved in 98% water:2% ACN:0.1% FA (v/v/v), mixed and analysed by the same LC–MS method that was used in the library construction process. Tandem MS spectra of three PANX3 peptides were manually reviewed in DDA files of cochlea in FragPipe.

The data analysis mentioned above was performed in R (v.4.4.0).

Development of a data resource website

The website was constructed using a standard three-tier web architecture comprising an HTML-based presentation layer, Java- and Python-based business logic layer and MySQL-based data persistence layer. The front-end interface communicates with back-end services via the HTTP protocol, with all structured data stored in optimized MySQL relational tables. User queries are processed through database matching operations followed by Python-based analytical processing before being returned to the presentation layer. For enhanced data visualization, DEP results are dynamically rendered using the ECharts JavaScript library (https://echarts.apache.org), enabling interactive graphical representations.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Spatial distribution of the proteome in the human body and in cancers

Collection of tissue samples

Sample preparation for LC–MS

LC–MS/MS analysis

Database searching and library generation

Spectral library evaluation

Analysis of DIA data by DIA-NN

Quality control and preprocessing

Protein identification and quantification

t-SNE visualization

Trajectory analysis of sample types

Tissue heterogeneity analysis

H&E staining

Comparison with previous human transcriptome and proteome draft

Tissue specificity analysis

Differential expression analysis

Consensus clustering analysis

Tumour-enriched proteins

Locally enriched DEPs

External database cross-referencing

Parallel reaction monitoring validation

Validation of protein detection

Development of a data resource website

Reporting summary

scientists inch closer to fertility breakthrough

NSF plans cuts to core science programmes to fund White House initiative

Think preprints are unreliable? Analysis of 70,000 studies might change your mind

Most Popular

‘Orion Medical Network,’ A Black-Owned Staffing Agency Opens In MD

EU Wants To Slow Down Fast Drivers Using Satellites In Orbit

Italian Grinder Chicken Salad | The Recipe Critic

Rush: Power Windows Album Review

Recent Comments

ABOUT US

POPULAR POSTS

‘Orion Medical Network,’ A Black-Owned Staffing Agency Opens In MD

EU Wants To Slow Down Fast Drivers Using Satellites In Orbit

Italian Grinder Chicken Salad | The Recipe Critic

POPULAR CATEGORY