Concurrent loss of the Y chromosome in cancer and T cells impacts outcome

TCGA data acquisition and processing

In this study, we used bulk RNA-seq, WES somatic mutation data, and clinical data (Supplementary Table 1) sourced from TCGA project (https://portal.gdc.cancer.gov/). The data matrices obtained from UCSC Xena (https://xena.ucsc.edu/) have been standardized, normalized and corrected for batch effects and platform differences. Additionally, mutation data generated by the PanCancer Atlas consortium (https://gdc.cancer.gov/about-data/publications/pancanatlas) were incorporated. A total of 29 tumour types and 4,127 male participants were included in our analysis. Survival outcome metrics, including OS, OS time, DSS and DSS time were calculated as in Liu et al.⁴⁶ (Supplementary Table 2).

Classification of LOY based on transcriptome data

We used the DESeq2 R package (v.1.42.1) to uncover Y chromosome gene expression differences between LOY_DNA and WTY_DNA⁴⁷. Differentially expressed genes were called on the basis of a log₂FC cut-off of −1 and a −log₁₀-adjusted P value cut-off of 200. This gene set resulted in the LOY prediction signature. Subsequently, we conducted a gene set intersection analysis with a gene set in the male-specific region of the Y chromosome seen expressed across 24 human tissues. This analysis required that genes exceeded 0.1 reads per kilobase of transcript per million reads mapped per tissue or that they had presence in IHC data from the Human Protein Atlas RNA-seq¹¹. This approach identified nine signature genes making up our Y chromosome signature (YchrS): DDX3Y, UTY, KDM5D, USP9Y, ZFY, RPS4Y1, TMSB4Y, EIF1AY and NLGN4Y. Based on single-sample GSEA (ssGSEA)⁴⁸ conducted with the GSVA⁴⁹ R package (v.1.44.5), we observed that patients with low levels of YchrS exhibited characteristics similar to those of people with LOY_DNA, whereas those with high YchrS levels resembled people with an intact Y chromosome (WTY_DNA). We partitioned all patients into low YchrS group (LOY_BR) and high YchrS group (WTY_BR) with a mean value cut-off. Additionally, a similar approach was applied to analyse the Ywhole signature (YwholeS), using a signature comprising all Y chromosome genes.

YchrS validation by using CCLE data

To validate the nine-gene Ychr Signature (YchrS), we downloaded the batch-corrected transcriptomic data and corresponding CNA data for 778 male cancer cell lines from the CCLE project (https://depmap.org/portal/ccle/). YchrS was calculated by the same method as for TCGA data. Cell lines with a YchrS lower than the mean value were categorized as LOY_BR, otherwise they were called WTY_BR. Average CNA for chromosome i in each cell line was calculated using the following formula to evaluate its integrity:

$${\rm{Average}}\;{{\rm{CNA}}}_{i}=\,\frac{{\sum }_{j=1}^{n}({\rm{estimated}}\;{\rm{segment}}\;{{\rm{CNA}}}_{j}\times {\rm{segment}}\;{{\rm{length}}}_{j})}{{\sum }_{j=1}^{n}{\rm{segment}}\;{{\rm{length}}}_{j}}$$

n is the total number of segments in chromosome i.

Genetic ancestry

Consensus ancestry for TCGA cases was obtained from ref. ¹⁴, determined by combining ancestry inference from five independent classification methods using SNP array and/or WES data. Only ancestries with more than 50 patients were included in the survival analysis, which spanned 3,893 patients: European (n = 3,319), East Asian (n = 286), African (n = 190) and African-admixed (n = 98).

Genomic instability and stemness features

Aneuploidy scores for TCGA cases were obtained from ref. ⁵⁰. Arm-level statistics were calculated using the GISTIC (v.2.0)⁵¹ copy number significance software. These scores were derived by tallying the total number of amplified or deleted arms, collectively termed ‘altered’. Samples were initially categorized by tumour type, alteration type (amplification or deletion) and chromosome arm. Subsequently, samples were clustered on the basis of specific arm attributes, and arms were classified as altered if part of a cluster had a mean fraction altered of at least 80%. Intratumour heterogeneity used to generate DNA damage scores was determined by ABSOLUTE⁵². ABSOLUTE analysed segmentation data from Affymetrix SNP6.0 arrays and variant calls from the MC3 variant file.

We used two measures to assess HRD. The first was derived by Knijnenburg et al.⁵³ and quantifies HRD by aggregating separate metrics of genomic scarring: large (more than 15 Mb) non-arm-level regions with LOH²⁰, large-scale state transitions (breaks between adjacent segments of greater than 10 Mb, LST²¹ and subtelomeric regions with allelic imbalance. The second measure, introduced by Telli et al.¹⁹, incorporates LOH and TAI²². The HRD score for samples analysed via custom hybridization sequencing assay were computed using reads covering SNP positions⁵⁴. The HRD score was determined as the unweighted sum of LOH, TAI and LST, represented as: HRD score = LOH + TAI + LST.

We screened samples using mRNA and DNA methylation profiles to compute four stemness scores: DNA methylation-based stemness score, epigenetically regulated DNA methylation-based stemness score, differentially methylated probe-based stemness score and enhancer elements/DNA methylation-based stemness score as outlined in previous studies²³.

Quantification of TNB and mutation data

Two methodologies were used to identify potential neoantigens arising from SNVs and Indels. For SNVs, somatic nonsynonymous coding SNVs were identified and minimal peptides encompassing mutation sites were extracted, followed by prediction of binding to autologous MHC using NetMHCpan v.3.0 (ref. ⁵⁵). On the other hand, Indel variants meeting specific criteria were extracted, and downstream protein sequences obtained to generate nine-mer peptides. These peptides were then evaluated for their ability to bind MHC molecules using the pVAC-Seq v.4.0.8 pipeline⁵⁶, also using NetMHCpan v.3.0. The mutation data, specifically encompassing missense mutations and nonsense mutations, were obtained from the PanCancer Atlas consortium and used in this analysis. The dataset (https://gdc.cancer.gov/about-data/publications/pancanatlas) underwent filtering, requiring mutation calls to be generated by two or more mutation callers (NCALLERS > 1).

Signature calculation for bulk-seq data

To study the consequences of LOY, a literature review was performed, and a variety of tumour-associated signatures gathered (Supplementary Table 3)^{57,58,59,60,61,62}. Each signature was assessed using ssGSEA implemented through the GSVA R package⁴⁹ (v.1.44.5). Scoring methods are in the figure legends. Where information is not provided in the figure legends, the methodologies were documented in the respective citations.

To validate the CRISPR–Cas9-mediated Y-KO MB49 (Y-KO)-derived gene expression signature in human BLCA data, we first performed a differential expression analysis between the Y-KO and Y-Scr groups, which identified a robust set of upregulated and downregulated genes. We then calculated a Y phenotype score by dividing the signature scores of upregulated genes by those of downregulated genes, with all values scaled to the [0,1] interval.

Pan-cancer scRNA-seq data collection

For pan-cancer scRNA-seq data, transcriptome data of 346 samples from 251 people across 20 scRNA-seq datasets were obtained from public studies^{63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82}, from which tumour samples were selected for later analysis. Accession numbers for each scRNA-seq dataset and detailed clinical information for patients and samples are summarized in Supplementary Table 4. To avoid issues related to platform heterogeneity, only datasets generated from 10x Genomics droplet based scRNA-seq datasets were included.

Quality control and preprocessing of pan-cancer scRNA-seq data

We performed quality control filtering and integration using the Scanpy package (v.1.9.5). Filtering was performed based on (1) confirmation that information was available for all nine Y signature genes, (2) cells had greater than 200 detected genes and (3) the mitochondrial gene counts were below 20%. Additional quality filters were applied to the data to remove barcodes that fell into any of the following categories: possible debris with too few genes expressed (less than 400) and too few unique molecular identifiers (UMIs) (less than 500), possibility of duplicate cells based on genes expressed (more than 5,500) or UMIs (more than 30,000). Count matrices and AnnData objects were then combined using a concatenate function, normalized to log transcripts per million units using the ‘sc.pp.normalize_total’ function, and log-transformed using the ‘sc.pp.log1p’ function. The normalized HNSC dataset from GSE150430 was then combined. Subsequently, non-tumour samples were removed, and we retained 1,030,968 high-quality cells and 14,689 genes for downstream analysis.

Combining and batch effect correction of pan-cancer scRNA-seq data

We used the scVI Python package (scvi-tools; v.1.0.4)⁸³ to integrate and batch correct scRNA-seq data. The scVI model was trained on the scRNA-seq data, considering samples as covariates. Following batch correction, the corrected data were integrated if multiple batches were present. The effectiveness of batch correction was evaluated by assessing the reduction in batch-specific variation while ensuring preservation of signal. Downstream analyses such as clustering, differential expression analysis or pathway enrichment analysis were performed on the batch-corrected data. Visualization of the results was achieved through two-dimensional UMAP plots, illustrating cell types, batches, datasets, gender, organs and cancer types.

Cell-type annotation of pan-cancer scRNA-seq data

To annotate cells, we used the scANVI algorithm from the scVI Python packages (scvi-tools; v.1.0.4) and the Luo et al.⁷¹ dataset, where cells were pre-labelled as epithelium, endothelium, fibroblast, lymphocyte, myeloid or plasma cell. Subsequently, we performed unsupervised clustering of the scANVI latent space, and then used Leiden clustering, followed by cluster assignment to specific cell types. The scANVI model was configured with max_epochs = 20 and cluster labels were transferred and predicted, guided by a sample size of n_samples_per_label = 100. The integrated latent embedding provided by scANVI served as the basis for downstream analysis, with the dataset segregated by cell type for further investigation. To delineate cell subtypes within myeloid cell, lymphocyte and plasma cell populations, we merged corresponding AnnData and mitigated batch effects and other sources of variation using scVI. Subsequently, we predicted subtypes using Celltypist⁸⁴ (v.1.6.2), using ‘majority voting’ with default parameters and the pre-trained ‘Immune_All_Low.pkl’ model.

Annotating LOY cells in pan-cancer scRNA-seq data via Random Forest

We collected scRNA-seq pan-cancer data from paired tumour and adjacent normal samples and, following preprocessing, categorized cells from adjacent normal samples as: male cells as having wild-type Y (WTY) chromosomes and female cells as LOY cells. Employing the train_test_split function from sklearn.model_selection, we divided data from the normal samples into training and test sets, with a split ratio of 70% for training and 30% for testing. Next, we trained a Random Forest classifier model (RandomForestClassifier from the sklearn.ensemble Python package (v.1.3.2)) to differentiate LOY and WTY cells based on the expression levels of the nine Y chromosome genes used for the bulk RNA-seq classification of LOY samples. The performance of the model was assessed using the test set, achieving an accuracy score of 0.83.

To further validate LOY prediction by a Random Forest model, we obtained 23 samples from Liu et al.²⁴, including sequencing data for single-cell RNA (GSE245552), bulk RNA (GSE255163) and WES (GSE255165). YchrS and average CNA of Y chromosome were calculated using the same method as CCLE data. LOY at the single-cell level were predicted by the same Random Forest model used for the pan-cancer single-cell datasets. Owing to absence of RPS4Y1 expression, it was set as 0 for all the cells when applying the Random Forest model.

Genomic DNA isolation and WES

Genomic DNA was isolated from CRISPR Y-KO cells and CRISPR–Cas9-mediated Scr MB49 Y⁺ control (CRISPR Y-Scr) cell lines using the Invitrogen kit (catalogue no. K1820) following the manufacturer’s instructions. DNA samples for WES were submitted to Novogene, where library preparation, sequencing and bioinformatics analysis were conducted. The genomic DNA was fragmented randomly into short pieces, end-repaired, A-tailed and ligated with Illumina adaptors. Following PCR amplification, size selection and purification, hybridization capture of libraries was performed. Captured libraries were further enriched by PCR amplification and assessed for quality using Qubit and bioanalyzer systems. The libraries were pooled and sequenced using Illumina platforms with the PE150 strategy. Sequencing data were processed using the GATK best practices workflow. Paired-end clean reads were aligned to the mouse reference genome (GRCm39/mm39) using the Burrows–Wheeler aligner. The resulting alignments were sorted with Sambamba and duplicate reads were marked using Picard. The coverage and sequencing depth were computed, and SNP and INDEL variants were identified.

scRNA-seq of mouse tumour tissues

A total of 1 × 10⁵ LOY MB49 cells (a LOY clonal line, MB49 clone 5 (C5)) were injected subcutaneously into C57Bl/6N mice (n = 4). Once the tumours reached 500 mm³, they were removed and processed for scRNA-seq. The tumours were cut and transferred immediately to MACS C-tubes along with chilled DMEM and tumour dissociation enzymes for mouse (Miltenyi Biotech, catalogue no. 130-096-730). The dissociated tumours were then processed using ACK Lysis buffer (Gibco, catalogue no. 2537772), dead cell removal kit (Stem Cell Technologies, catalogue no. 17899) and EasySep Mouse CD45 Positive Selection Kit (Stem Cell Technologies, catalogue no. 18945). The CD45-enriched cells were next stained with Hashtag antibodies (TotalSeq-B0301 anti-mouse Hashtag 1 Antibody; TotalSeq-B0302 anti-mouse Hashtag 2 Antibody) and stained sequentially for CD45, CD3, CD4, CD8, CD11b, F4/80 and B220 along with 4′,6-diamidino-2-phenylindole (DAPI) and sorted for CD4⁺, CD8⁺ and CD11b⁺ and mixed into equal ratios. This mixture of highly enriched CD45⁺ cells was combined in a 1:1 ratio with live CD45⁻ cells to make a final mixture that was sent for scRNA-seq.

The Cedars-Sinai Applied Genomics, Computation and Translational Core used 10x genomics 3′ scRNA-seq to sequence all samples to around 60% saturation. Samples were processed using Cell Ranger (10x genomics) based on a pre-mRNA GRCh38 reference. Since the samples were not hashed, potential doublet cells were identified using Scrublet applied to the filtered feature barcode matrices from Cell Ranger. Scrublet analysed the 10% most variable genes, identified by Scanpy package (v.1.9.5, scanpy.pp.highly_variable_genes function), predicting a 10% doublet rate, and then discarded doublet cells. Finally, nuclei with over 10% of their UMIs linked to mitochondrial genes or those in the top and bottom 5% based on the number of unique genes and UMI count were also removed.

One female and one male normal healthy C57BL/6N bladder sample from our previous study³¹ were also analysed. Filtering was performed based on: (1) cells had more than 200 detected genes and (2) the mitochondrial gene counts were below 20%. Additional quality filters were applied to the data to remove barcodes that fell into any of the following categories: possible debris with too few genes expressed (less than 400) and too few UMIs (less than 500), possibility of duplicate cells based on genes expressed (more than 30,000) or UMIs (more than 5,500). We normalized the data to 1 × 10⁴ counts per cell and calculated the base-10 logarithm. We used sc.pp.combat to remove the batch effect and applied subsequent downstream analyses on the batch-corrected data. To annotate cells, we used the scVI and scANVI algorithm from the scVI Python packages (scvi-tools).

Analysis for xenograft scRNA-seq datasets

To further investigate the ability of LOY malignant cells to induce LOY in benign cells in the TME, we downloaded public scRNA-seq datasets of human xenografts in immunocompromised mice from GSE254890 (ref. ³⁵) (SW480 cells from a male patient with CRC injected into male mice, 14 samples were incorporated as SW480 group), GSE110501 (ref. ⁸⁵; only eight samples from normal tissues were incorporated as male control) and GSE144236 (ref. ⁸⁶; A431, SCC and CAL27 injected into female mice, three samples incorporated as female control). Based on scRNA-seq and bulk RNA-seq data provided, SW480 cells used were LOY cells, which matched with the Y chromosome information obtained by our CCLE analysis on RNA-seq data. Mouse cells were selected either by tumour cell depletion using FITC conjugated antibodies, or by expression level of mouse genes compared with human genes. Potential debris (cells with fewer than 200 expressed genes or 400 UMIs) and possible doublets (cells with more than 8,500 expressed genes or 30,000 UMIs) were filtered out. After normalization, batch correction and cell type annotation were performed by scVI and scANVI. YchrS was calculated ‘scanpy.tl.score_genes’ using all Ychr gene expression.

Mouse HCC studies

Mice

MUP-uPA⁺ on C57BL/6 (ref. ⁸⁷) background 1 were bred and housed under specific pathogen-free conditions in an American Association for Accreditation of Laboratory Animal Care-approved barrier facility at Cedars-Sinai Medical Center. MUP-uPA⁺ mice were fed a Western diet (Teklad, catalogue no. TD.88137) for 8 months beginning at 8 weeks after birth. HCC development was analysed at 10 months of age.

Tissue preparation for mouse scRNA-seq and single-nuclei RNA-seq

Mice were killed by CO₂ inhalation and livers were perfused with PBS containing 2% of heparin (20 USP units ml⁻¹) to remove traces of blood. For single-nucleus preparation, livers were isolated, tumour tissues were dissected and cut into 50 mg tumour tissue pieces for single-nucleus isolation and sequencing. Tissue was frozen in dry ice (solid CO₂) and kept in liquid nitrogen for long-term storage. For single-cell preparation, livers were isolated, and tumour tissues were dissected and digested using a cocktail of digestion enzymes containing collagenase I (450 U ml⁻¹) (Sigma-Aldrich, catalogue no. C0130) and DNase I (120 U m⁻¹) (Sigma-Aldrich, catalogue no. D4263) in PBS (with Ca2⁺/Mg²⁺) for 30 min at 37°C with gentle shaking at 150 rpm for liver immune cell isolation. After incubation, cell suspensions were filtered through a 70 µm cell strainer. Immune cells were enriched by density-gradient centrifugation over Percoll (GE Healthcare, catalogue no. 17-0891-01) at 1,000g for 25 min without brake (40% Percoll in RPMI-1640 and 80% Percoll in 5% FBS/PBS). Leukocyte rings on a border of gradient were collected, washed and stained. Immune cell suspensions were stained with Zombie Aqua (BioLegend, catalogue no. 423101) on ice for 15 min to exclude dead cells, incubated with Fc Block TruStain FcX (Clone 93, BioLegend, catalogue no. 101320, RRID: AB_1574975) for 20 min in 2% FBS-PBS and then stained with fluorochrome labelled antibody for 30 min on ice (CD45-PerCP/Cyanine5.5 (QA17A26, BioLegend, catalogue no. 157612; RRID, catalogue no. AB_2832558, 1:100)). All the flow cytometry antibodies were validated by the manufacturer (BioLegend) and were validated in the laboratory in single channel controls. Live, CD45⁺ cells were sorted by BD sorter Aria III using a 100-µm nozzle.

Mouse scRNA-seq

The single-cell droplets were generated with a Chromium X controller using Chromium Next GEM Single Cell 3′ Reagent Kits v3.1 (Dual Index) (10X Genomics, catalogue no. 1000268). Approximately 8,000 to 10,000 cells were collected to make cDNA at the single-cell level. cDNA amplification and library construction were performed according to the manufacturer’s instructions. All cDNA and libraries were quantified via Agilent Technologies 2100 Bioanalyzer. Gene expression libraries were sequenced at a targeted depth of 50,000 reads per cell on the Illumina Novaseq X plus (Illumina) at Novogene. Fastq files were obtained and then processed with Cell Ranger v.8.0.1 aligning to the mouse (mm10) 2020-A reference genome on 10x Genomics Cloud Analysis.

Mouse single-nucleus RNA sequencing

Single nuclei were isolated from frozen tumour tissues using the Chromium Nuclei Isolation Kit (10x Genomics, catalogue no. 1000493) according to the manufacturer’s instructions. cDNA amplification, library construction, sequencing and genome mapping were performed in the same way as for mouse scRNA-seq.

Validation of LOY effect using independent scRNA-seq data

Processed scRNA-seq data and corresponding cell type information for 116 liver cancer samples from 94 male patients⁴⁰ were analysed using our Random Forest model to predict LOY at the single-cell level. Only primary tumour or metastasis samples were included in the survival analysis.

InferCNV analysis

For the results presented in Extended Data Fig. 8e,f, CNVs in the scRNA-seq data were predicted using the InferCNV tool (https://github.com/broadinstitute/inferCNV; v.1.13.0), so that differences in the frequencies between the LOY_SCR and WTY_SCR epithelial cells of gains or deletions of entire chromosomes or large chromosomal segments could be identified. The algorithm was run with default parameters, using all WTY_SCR stromal cells and immune cells as reference cells. For the results presented in Extended Data Fig. 6c,d, the analysis and figure were generated using Infercnvpy (https://github.com/icbi-lab/infercnvpy; v.0.4.5).

Functional signature calculation for scRNA-seq data

We used the ‘scanpy.tl.score_genes’ function from the Scanpy Python package (v.1.9.5) to compute gene set scores across cells. This function calculates gene scores for each gene listed^62,88,89 in ‘gene_list’ across all cells stored within the dataset.

Sorting immune and epithelial cells from tumours and PBMCs

We used three distinct models to examine LOY in vivo: (1) naturally occurring Y⁺ and LOY (Y⁻) cells⁷, (2) an established LOY clonal line (C5) and (3) CRISPR-engineered Y-KO and Y-Scr cells⁷. For the tumour challenge, 1 × 10⁵ cells from each line—MB49 clone 5 (C5), Y⁻ (LOY), Y⁺ (WTY), CRISPR Y-KO and CRISPR Y-Scr—were injected subcutaneously into the flanks of C57Bl/6N mice (n = 7 per group) obtained from Taconic Biosciences⁷.The DNA levels of Y chromosome genes in each engineered cell type were checked for abundance before injection. Subcutaneous tumours from each group were disrupted mechanically, filtered through a 70 µm cell strainer (Corning, catalogue no. 352350) with RPMI-1640 cell culture grade media (Gibco, catalogue no. 11875093). ACK lysis buffer (Gibco, catalogue no. A1049201) was used to disrupt the infiltrating red blood cells. The single cells were then centrifuged and resuspended in 1× HBSS buffer (Gibco, catalogue no. 14025092) and counted in a cell counter machine. Tumours with viability of greater than 60% were used for subsequent procedures. EasySep Dead Cell Removal (Annexin V) Kit (Stem Cell Technologies, catalogue no. 17899) was used to increase the viability of each tumour and remove the dead cells. Viable cells were next processed with EasySep Mouse CD45 Positive Selection Kit (Stem Cell Technologies, catalogue no. 18945). The CD45 cells with the positive magnetic beads were purified and resuspended in EasySep Buffer (Stem Cell Technologies, catalogue no. 20144). These cells were next stained with CD45-Alexa Fluor 488 (BioLegend, catalogue no. 103122, 1:40) and Viability ghost dye Red 710 (Cytek, catalogue no. SKU 13-0871-T100) and sorted for only CD45⁺ cells in BD AriaIII machine. These highly purified CD45⁺ immune cells were next used for isolating high-quality genomic DNA with Monarch Nucleic Acid Purification Kits (NEB) and quantified. The flow through obtained after the CD45 positive selection, containing stromal, endothelial or other blood cells, was next processed with the EasySep Mouse Epithelial Cell Enrichment Kit II (Stem Cell Technologies, catalogue no. 19868), to isolate only the epithelial cells. The purity of pre- and post-isolation populations was assessed by antibody staining of random samples from each group, using the CD45 AF488 (BioLegend, catalogue no. 103122, 1:40) antibody and the BD Symphony A5, with results being analysed using Flow Jo software v.10.9.0.

For isolating T cells from these tumours (n = 6 each group) and their respective PBMCs, the following panel was used: CD45 (BioLegend, catalogue no. 103116; 1:40); CD3 (BD, catalogue no. 749276, 1:40); TCR β chain (TCRb, BioLegend, catalogue no. 109205, 1:40); CD11b (BD, catalogue no. 563553, 1:40); CD45R (BD, catalogue no. 612950, 1:40); Ghost dye Red 710 viability dye (Tonbo Biosciences, catalogue no. 13-0871-T100). Three compartments were sorted with BD FACSymphony S6 machine: epithelial (Ghost Red 710 dye⁻CD45⁻ or CD45⁻TdTomato⁺); non-T immune cells (Ghost Red 710 dye⁻CD45⁺CD11b⁺B220⁺) and T cells (Ghost Red 710 dye⁻CD45⁺CD11b⁻B220⁻CD3⁺TCRb⁺). The sorting gates and FACS data are shown in Supplementary Fig. 1. Subsequently, the sorted cell compartments from tumours from the both groups were used for high-quality DNA extraction and subsequent qPCR. FC was calculated by comparing individual dCT values to respective average WTY(Y⁺) dCt or average Y SCR dCt values.

To get substantial amount of viable T cells for DNA isolation from subcutaneous tumours, the tumours of one group were mashed and pooled together and then proceeded for sorting. However, after sorting, the T cells were aliquoted randomly to four tubes and then processed for DNA isolation, to increase the efficacy of the result and decrease the chances of human error. Therefore, even if each group had six tumours pooled together, the graph in Extended Data Fig. 10c (right panel) shows four dots of one colour.

RNA-seq of CD45⁻ cells from CRISPR Y-KO and Y-Scr tumours

Tumours derived from subcutaneous injection of CRISPR Y_KO and CRISPR Y-Scr cells into the flanks of C57BL/6N mice were removed and processed. Tumour-derived single-cell suspensions were subjected to FACS to isolate CD45⁻ cells as described above. Total RNA was extracted from the isolated cells using the RNeasy Plus Mini Kit with gDNA Eliminator columns (Qiagen), following the manufacturer’s protocol. RNA sequencing library preparation and sequencing were performed by Novogene. Quality assessment of RNA-seq data, including sequence, alignment and quantification metrics, was conducted using FastQC v.0.12.1 and summarized with MultiQC v.1.13. Illumina Truseq adaptor, polyA and polyT sequences were trimmed using Trimmomatic v.0.39. The trimmed reads were aligned to the mouse genome (GRCm39/mm39) using STAR aligner v.2.5.2b, with parameters aligned to the ENCODE long RNA-seq pipeline recommendations (https://github.com/ENCODE-DCC/long-rna-seq-pipeline). Gene-level expression was quantified using featureCounts v.1.5.3, using Ensembl gene annotations (release v.113) for both alignment and quantification.

Genes with low expression were filtered out by applying a threshold of sum of estimated counts (from featureCounts) of at least ten. Differential gene expression analysis was performed on filtered estimated read counts using the R Bioconductor package DESeq2 v.1.42.1, using a generalized linear model with a negative binomial distribution. Differentially expressed genes were identified based on a Benjamini–Hochberg adjusted P value < 0.05 and FC cut-off (≥2 or ≤−2). To validate the Y-KO-derived gene phenotype signature in human cancer data, we first excluded all Y-linked genes to prevent bias in the differential expression results. We then calculated a LOY gene phenotype score by dividing the signature scores of up-regulated genes by those of down-regulated genes, with scores scaled to a [0,1] interval.

BBN treatment and PBMC isolation

For the BBN experiment, 8-week-old mice (n = 3–4 mice per timepoint) were administered with 0.5% BBN water for 12 weeks. After 12 weeks, BBN was replaced with standard tap water. Mice were killed at 2, 4, 12, 20 and 25 weeks using isoflurane. Subsequently, blood was drawn directly from the heart using an ethylenediaminetetraacetic acid (EDTA)-prewet insulin syringe and collected in an EDTA Microvette. From each mouse, a range of 700 to 1,000 µl of blood was collected. To enhance yield and DNA quality, mice were pooled in each group. Pooled samples were diluted 1:1 (v:v) with PBS-EDTA 0.1 M and then stratified on Histopaque-1077 Ficoll (Sigma, catalogue no. GE17-5446-02) at a ratio of 3:1 (v:v). Samples were centrifuged for 30 min at 400 rcf, with acceleration and break ramps set to 0 to allow gradual phase separation. The resulting PBMC ring was collected and washed once with PBS (1X). A subsequent centrifugation of 5 min at 400 rcf (with acceleration and break ramps set to maximum speed) resulted in a pellet that was processed for DNA extraction.

Tissue microarray

A human BLCA tissue microarray (TMA) with 33 unique cases comprised of triplicate cores from each patient tumour with an individual core size of 1 mm was used. The TMA was comprised of both male (n = 18) and female (n = 15) patients. Cores from female patients were used as controls for FISH.

XY FISH staining

The unstained TMA formalin-fixed paraffin-embedded sections (4 μm) were baked at 55–60 °C overnight before subjecting the slide to the following steps on the Abbott VP2000 FISH Instrument: (1) deparaffination of the slide using xylene, (2) pre-treatment of the slide using 0.2 N HCl and 1 M NaSCN, (3) protease treatment with pepsin, (4) fixation in 10% buffered formalin and, finally, (5) dehydration in series of increasing concentration (70%, 85% and 100%) ethanol. The slide was then subjected to a co-denaturation step using ThermoBrite (melting temperature 73 °C, 5 min; hybridization temperature 37 °C overnight). Post hybridization the slide was washed twice with SSC/0.3% NP-4 72 ± 1 °C for 2 min and twice with 2× SSC/0.3% NP-4 15–30 °C for 1 min. Finally, the slide was counterstained with nuclear DAPI before sealing with coverslip for visualization. The fluorescence tags were as follows: CEPX Xp11.1-q11.1—Spectrum Green (excitation, 497 nm; emission, 524 nm) CEPY Yp11.1-q11.3—Spectrum Orange (excitation, 559 nm; emission, 588 nm) and 18S RNA probe—Spectrum Aqua (excitation, 433 nm; emission, 480 nm).

IHC staining

Formalin-fixed paraffin-embedded samples were sectioned at 4-μm thickness onto Superfrost Plus slides (Fisher Scientific, catalogue no. 12-550-15). IHC staining was performed on the Ventana Discovery Ultra Instrument (Roche) as described⁹⁰. After applying antigen retrieval buffer (CC1 (Tris, pH 8.0) (Roche Ventana, catalogue no. 950-124), CD45 primary antibody (Cell Signaling, catalogue no. 13917S, rabbit monoclonal) was applied. Primary antibody was diluted antibody dilution buffer (Roche Ventana, catalogue no. ADB250) for 1 h at room temperature: anti-CD45 (1:500). DISC anti-Rabbit HQ (Roche Ventana, catalogue no. 760-4815) was then applied for chromagen staining. After DAPI nuclear counterstain, the tissue area was covered coverslipped and mounted with ProLong antifade medium (Invitrogen, catalogue no. P36984).

Whole-slide imaging

XY FISH immunofluorescence slides were scanned using the ZEISS Axio Scan.Z1 whole slide scanner at ×20 magnification (Plan-Apochromat lens (numerical aperture, 0.8; M27)). TMA tissue cores were outlined with permanent marker pen on the coverslip for tissue detection. Region of scan was generated by a polygon tool and the raw focus map was generated using the ‘every-2-tiles’ strategy (z-range 150 µm, 21.04-µm step size) under ×5 lens (Fluor ×5/0.25 M27), while a fine focus map was generated using onion skin (z-range 100 µm, 2.06-µm step size, 0.1 density, 24 maximum number of points). Both focus maps were generated in the DAPI channel at 2% LED intensity, 50-ms exposure time. Spectrum Green (X probe) was excited at 495-nm wavelength (5% LED intensity, 150-ms exposure time) and detected at 500–550-nm bandwidth. Spectrum Orange (Y probe) was excited at 548-nm wavelength (25% LED intensity, 150 ms exposure time), detected at 570–640-nm bandwidth. Nuclear DAPI fluorescence was excited at 420-nm wavelength (2.5% LED intensity, 50-ms exposure time), detected at 430–470-nm bandwidth. Spectrum Aqua (18S RNA probe) was excited at 434-nm wavelength (5% LED intensity, 150-ms exposure time) and detected at 460–500-nm bandwidth. All signals were detected by Hamamatsu Orca Flash camera and 16-bit depth format image setting was applied. Standard IHC slide was imaged using the Leica Aperio AT2 whole slide scanner at ×20 magnification.

Image quantitation and analysis: HALO AI module

The whole slide image obtained from Zeiss Axio scan system was imported into HALO AI Module (Indica Labs), v.4.0.5107.318, for analysis. Upon import, the TMA image underwent segmentation to identify individual tissue cores. Missing cores were identified and removed from the analysis. The remaining cores were processed using the Nuclei Segmentation AI module, following the manufacturer’s guidelines. For AI training, seven distinct regions of interest were selected, comprising a total of 43 nuclei, to refine the Nuclei Segmentation plugin. Following segmentation, FISH analysis was conducted using the HALO FISH module, v.3.2.3, to detect nuclear signals. The resulting data were exported as.csv files, containing object-level (cell) data for subsequent analysis.

Validation of LOY correlation via FACS-sorted scRNA-seq data

To further validate the results and minimize the impact of mis-annotating LOY epithelial cells as LOY immune cells, we analysed 21 CD45-based FACS-sorted samples from three independent public scRNA-seq datasets (HNSC, GSE182227 (ref. ³²); CHOL, GSE171899 (ref. ³³) and BLCA, GSE211388 (ref. ³⁴)). This collection included 12 matched CD45⁺ and CD45⁻ samples from six tumours. Detailed dataset and sample information are provided in Supplementary Table 7. The same quality control, normalization and batch correction procedures described for the pan-cancer human scRNA-seq datasets were applied. CD45 expression was validated to ensure the purity of the FACS-selected samples. LOY cells were predicted using a Random Forest model.

To further validate the accumulation of LOY immune cells in tumours, we analysed two additional datasets: RCC dataset³⁶ (accession number, EGAD0001008030), comprising 14 matched tumour (also included in the pan-cancer datasets) and blood samples. HNSC dataset (accession number, GSE139324 (ref. ³⁷)), containing 38 matched TIL and PBMC samples. Detailed dataset and sample information are presented in Supplementary Tables 4 and 7. The same quality control, normalization, batch correction and LOY cell prediction methods were applied. Additionally, cell types were identified using the scANVI algorithm and marker gene expression, with annotated pan-cancer scRNA-seq datasets serving as references.

Long-term in vitro T cell stimulation assay

Mouse CD8⁺ T cells were isolated from spleens of C57BL/6N mice using the Mouse CD8⁺ T Cell Isolation kit (Miltenyi Biotec, catalogue no. 130-104-075). CD8⁺ T cells were then activated by seeding onto six-well plates coated with 1 µg ml⁻¹ anti-CD3 (clone 2C11, BioLegend, catalogue no. 100302) and anti-CD8 (clone 37.51, BioLegend, catalogue no. 102102). T cells were cultured in RPMI supplemented with 10% fetal calf serum, 1% penicillin–streptomycin, 50 µM 2-mercaptoethanol, 1% insulin transferrin sodium selenite as well as 0.1 µg ml⁻¹ IL-7 (Peprotech, catalogue no. 217-17) and IL-2 (Peprotech, catalogue no. 212-12). T cells were kept at no higher than 1 × 10⁶ml⁻¹ and transferred to new coated plates every 2–3 days to maintain activation. DNA and RNA was isolated at the timepoints indicated using the Monarch Genomic DNA Purification Kit (New England Biolabs, catalogue no. T3010L) or RNeasy Plus Mini Kit with gDNA Eliminator (Qiagen, catalogue no. 74134), respectively. CDNA was generated with Maxima H Minus cDNA Synthesis Master Mix (Thermo Fisher, catalogue no. M1662) followed by qPCR. Data were normalized to day 1.

Quantitative PCR

Genomic DNA (10 ng per reaction) was used to detect and quantify Y‐chromosome-specific genes associated with LOY signatures—Kdm5d, Uty, Eif2s3y, Ddx3y, Ssty1, Ssty2 and Zfy1/2—as well as to detect the presence of Cas9 in immune and epithelial compartments. Housekeeping genes (B2m, Gapdh and Actb) served as endogenous controls. All qPCR reactions were carried out using SYBR Green Universal Master Mix (Applied Biosystems, catalogue no. 4309155) on a Quant Studio 6 Flex Real-Time PCR system (Applied Biosystems). To assess Y chromosome copy number in various TIL immune cell populations (T cells and non-T cells), we compared cycle threshold (Ct) values from each sorted population against Ct values from wild-type male tumour cells, using the ΔΔCt method to calculate FC. For Cas9 detection, Ct values in sorted immune populations were compared with DNA from wild-type C57Bl/6N mice, which have no Cas9 integration, using Gapdh and Actb. Absence (or near-absence) of Cas9 amplification in immune compartments confirmed that these cells were not contaminated by, or had phagocytosed, genome-edited epithelial cells (Supplementary Fig. 1).

Details of primer sequences used in this manuscript for DNA qPCR are as follows:

Uty forward, TCACCCTCTTCAGCCATTTC; reverse, GTTCTCATGCCCTTCTCCATTA

Kdm5d forward, CTGCAAGATGGCTGCATTTC; reverse, TCGCTCCTCCTGTACCATAA

Ddx3y forward, AGCAGATTCAGTGGAGGATTT; reverse, CCACTACTTCGGCTGCTATT

Eif2s3y forward, CGTTATGCCGAGCAGATAGAA; reverse, CCGTCTCAGTAGGAAGTAGGA

Sssty1 forward, TGAAGAAGAGGAGGAGGAAGT; reverse, TTGGGTGACAGGCTCATTAC

Ssty2 forward, GGTGCCATTCTTACAGGACTAT; reverse, GTGGAGGTTACCTTCCTTGTAG

Zfy1/2 forward, CACCAAGAAAGCAGAACACATC; reverse, GCCTTTGTGTGAACGGAAATTA

Gapdh forward, AACAGCAACTCCCACTCTTC; reverse, CCTGTTGCTGTAGCCGTATT

Actb forward, ACCCAGGCATTGCTGACAGG; reverse, GGACAGTGAGGCCAGGATGG

B2m forward, ACAGTTCCACCCGCCTCACATT; reverse, TAGAAAGACCAGTCCTTGCTGAAG

Cas9 forward, CCCAAGAGGAACAGCGATAAG; reverse, CCACCACCAGCACAGAATAG

RNA primers:

Pdcd1 forward, CGGTTTCAAGGCATGGTCATTGG; reverse, TCAGAGTGTCGTCCTTGCTTCC

Havcr2 (Tim3) forward, GTATCCTGCAGCAGTAGGTC; reverse, CCCTGCAGTTACACTCTACC

Ctla4 forward, GTACCTCTGCAAGGTGGAACTC; reverse, CCAAAGGAGGAAGTCAGAATCCG

Tcf7 forward, CCTGCGGATATAGACAGCACTTC; reverse, TGTCCAGGTACACCAGATCCCA

Gapdh forward, ATGCCTCCTGCACCACCAACT; reverse, ATGGCATGGACTGTGGTCATGAGT

Actb forward, ACCCAGGCATTGCTGACAGG; reverse, GGACAGTGAGGCCAGGATGG.

FC values were calculated using the ΔΔCt method relative to the appropriate wild-type controls. Data were used to quantify the relative copy number of Y chromosome genes in tumour‐derived immune subsets or to confirm the absence of Cas9 in cells not genetically engineered.

Validation of T cell exhaustion impact on LOY via scRNA-seq

The publicly available processed scRNA-seq dataset from Giles et al.³⁹ was analysed to investigate the impact of chronic stimulation on the stability of the Y chromosome in T cells. This dataset included gp33-specific CD8⁺ T cells from TCR-transgenic mice subjected to acute (LCMV Armstrong) and chronic (LCMV Clone 13) LCMV infections. Chronic stimulation of T cells was validated through the upregulation of canonical exhaustion markers, including Tox, Pdcd1 and Ctla4, and the downregulation of Tcf7. To evaluate LOY, the expression levels of Y-linked genes (Uty, Kdm5d, Ddx3y, Usp9y) were analysed.

Cell-type-specific gene signatures for deconvolution analysis

To generate cell-type-specific gene signatures for LOY_SCR/WTY_SCR epithelial cells, CD4⁺ T cells, and CD8⁺ T cells from scRNA-seq data, we conducted differential analysis using the ‘sc.tl.rank_genes_groups’ function of the Scanpy (v.1.9.5) package. This analysis used Wilcoxon rank sum (Mann–Whitney U) tests to identify significant differences across each LOY_SCR and WTY_SCR cell type. We first identified genes significantly up-regulated (log₂FC > 1, adjusted P value < 0.05) in the LOY_SCR versus WTY_SCR epithelial cells, CD4⁺ T cells and CD8⁺ T cells separately. To establish unique signatures for each cell type, we then excluded genes expressed in more than 15% of any other LOY_SCR or WTY_SCR cell type. We then performed deconvolution on normalized bulk expression data from TCGA cancer types using the ssGSEA algorithm, evaluating the relationship of these signatures with patient outcome.

Survival analysis

Time-to-event outcomes were presented by using Kaplan–Meier curves and compared by using log-rank test or univariate Cox proportional hazards model (survival R package; v.3.5.8) as noted in each figure. Two multivariable Cox proportional hazards models were fitted, each as a function of (1) YchrS with ancestry and race as known risk factors and confounders; (2) LOY signatures scRNA-seq signatures, including LOY_SCR CD4⁺ T cell, LOY_SCR CD8⁺ T cell, and LOY_SCR epithelial cell signatures, with age as known risk factors and confounders. Hazard ratio along with 95% CI based on multivariable Cox proportional hazards models were reported. The function surv_cutpoint from the survminer R package (v.0.4.9) was used to determine the optimal cut-off value for the LOY_SCR signatures in relation to the time-to-event outcome. This method uses maximally selected rank statistics from the maxstat⁹¹ R package (v.0.7-25) to classify two groups (low- versus high-risk) based on the optimal cut-point. Moreover, continuous variables included as covariates in the Cox proportional hazards model were evaluated⁹². Linearity was assessed to ensure model adequacy.

Development and validation of prognostic nomogram

According to clinical risk factors and risk scores of multivariate Cox regression coefficients for Extended Data Table 2, a prognostic nomogram was established using the ‘rms’ R software package (v.6.8-0), and the prediction accuracy of the nomogram was assessed using the calibration curve to evaluate the match between expected and observed events at 2, 5 and 8 years.

Ethics statement

Human samples

All human specimens and associated data were collected following protocols approved by the Institutional Review Board (IRB protocol 43021) at Cedars-Sinai Medical Center, adhering strictly to the Declaration of Helsinki guidelines. Written informed consent was obtained from each participant or their legal guardian. Detailed information regarding patient recruitment, sample collection (including TMA preparation), and data management can be found within IRB protocol 43021.

Animal studies

All animal procedures were reviewed and approved by the Institutional Animal Care and Use Committee (IACUC protocol 8253) at Cedars-Sinai Medical Center. Experiments were conducted in strict accordance with the guidelines specified in the National Institutes of Health (NIH) Guide for the Care and Use of Laboratory Animals. Protocol 8253 comprehensively describes animal housing conditions, care standards and experimental methodologies. All animal experiments were performed in accordance with institutional IACUC protocols. Mouse were housed under standard conditions with a 12-h light/12-h dark cycle, temperatures maintained between 68 °F and 79 °F (20–26 °C), and relative humidity between 30% and 70%.

Statistical analysis

All analyses were conducted using R (v.4.3.1) and Python (v.3.10.9). Before commencing tests, data were assessed for normality using the Kolmogorov–Smirnov test, followed by Bartlett or Levene tests to evaluate homogeneity of variances. For normally distributed variables, the unpaired Student’s t-test (Stats R package; v.4.3.1) was applied, whereas non-normally distributed variables were analysed using Wilcoxon rank sum tests. The correlation between paired variables was assessed using Spearman’s correlation coefficients. Data presentation and multiple comparison corrections are as stated in figure legends. Statistical significance was considered when P values were less than 0.05, including adjusted P values. Discovery analyses involving more than 20 comparisons underwent multiple testing correction using the p.adjust function in R or multipletests function in Python, applying the Benjamini–Hochberg method to control the false discovery rate at 0.05. To compare ROC curves, we used the roc.test function from the pROC R package (v.1.18.5). This allowed us to assess differences between AUC of YchrS and the AUC of YwholeS or LOY_DNA. Python packages such as Scanpy (v.1.9.5), Pandas (v.2.0.0), Statsmodels (v.0.14.0), NumPy (v.1.24.2), Scipy (v.1.10.1), Matplotlib (v.3.8.0), Seaborn (v.0.11.2) and Sklearn (v.1.3.2), were used for data analysis. The R package ComplexHeatmap (v.2.11.1) was used to generate heat maps, and visualization was facilitated using ggplot2 (v.3.3.5), ggpubr (v.0.6.0), ggrepel (v.0.9.2), Statannot (v.0.6.0), Circlize (v.0.4.16), GseaVis (v.0.0.5), Enrichplot (v.1.22.0), GridExtra (v.2.3.0), Pheatmap(v.1.0.12) and DEGreport (v.1.38.5) R packages. For data manipulation, Readr (v.2.1.5), Readxl (v.1.4.3), Dplyr (v.1.1.4), Plyr (v.8.9), Apeglm (v.1.24.0), Tidyr (v.1.3.1), Tidyverse (v.2.0.0), Tibble (v. 3.2.1), Iranges (v.2.36.0), Biobase (v.2.62.0), BiocGenerics (v.0.48.1), Lubridate (v.1.9.3), Stringr (v.1.5.1) and AnnotationDbi (v.1.64.1) were used for analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.