A disease model resource reveals core principles of tissue-specific cancer evolution

Cell line collection, characterization, maintenance and dissemination

MCCA lines were generated either in-house, provided by collaborators or obtained from public repositories. Details of the original source (laboratory or commercial vendor) of each MCCA line are provided in Supplementary Table 1. In case a cell line has been published previously, the original research article for each individual MCCA line is referenced in Supplementary Table 1. MCCA lines were maintained under cell-line-specific conditions. For each MCCA line, the medium composition (basal medium, growth factors and so on) and cell culture requirements (adherent/suspension, extracellular matrix/coating) are provided in Supplementary Table 1.

To ensure the quality, utility and long-term preservation of MCCA cell lines, a strategy for MCCA handling and maintenance was established that covers several aspects. First, rigorous quality controls are performed, ranging from regular mycoplasma testing and human DNA detection PCRs to regenotyping of mouse alleles to prevent contamination or misidentification of cell lines. Moreover, the recombination of genetically engineered alleles was tested in cell lines directly after their isolation from mice to detect potential fibroblast contaminations (which were removed through differential trypsinization). Second, standardized protocols for the culture and maintenance of cell lines were implemented, encompassing (1) the assessment of cell density before molecular characterization; (2) the propagation of cell lines using splitting ratios adapted to proliferation rate; and (3) the amplification of cell lines from the same pool of original cells to minimize variation between batches of cells. Third, comprehensive phenotypic and molecular characterization of cell lines were performed using standardized analytical approaches. To ensure a high quality of analyses, computational approaches optimized for the mouse were applied¹³. Fourth, to ensure secure long-term preservation of the MCCA resource, backup vials for each line have been archived at at least two independent locations. Finally, all information relevant for the request of MCCA lines can be found on the ‘Resource availability’ page at www.mcca.tum.de.

All characterization data generated as part of MCCA are publicly available through a mouse-specific cBioPortal instance (branched from main v.3.7.1) at www.mcca.tum.de.

Animal cohorts and experiments

Mice were housed under specific-pathogen-free conditions in groups of up to five animals per cage under a 12 h–12 h light–dark cycle at 21–22 °C temperature and 45–65% relative humidity, and supplied ad libitum with standard chow and water. Female and male mice were randomly submitted to respective tumour cohorts. The maximal tumour size/burden permitted by the IACUC and the local authorities (Regierung von Oberbayern) is 1.5 cm in diameter, which was not exceeded in our study. All animal studies were conducted in compliance with European guidelines for the care and use of laboratory animals and were approved by the Institutional Animal Care and Use Committees (IACUC) of the Technische Universität München, Regierung von Oberbayern and the UK Home Office. All genetically engineered mouse alleles included in this study are referenced in Supplementary Table 1.

Histopathological analyses

For histological characterization, 2-μm-thick specimens from formalin-fixed paraffin-embedded material were routinely stained with haematoxylin and eosin (H&E), scanned (Leica LAS X, v.3.7.5.24914; Leica Aperio ImageScope, v.12.4.3.5008) and submitted to at least two veterinary pathologists experienced in comparative cancer pathology in mouse models. Histomorphological evaluation and tumour grading were performed according to the guidelines of the MMHCC (Mouse Models of Human Cancers Consortium (NIH/National Cancer Institute)) and organ-specific INHAND classifications⁵⁷. If required, immunohistochemistry was performed to validate H&E-based histopathological classification.

gDNA and RNA isolation

Cells were cultured according to the conditions described in Supplementary Table 1. Isolation of genomic DNA (gDNA) and RNA was conducted according to the manufacturer’s instructions using the DNeasy Blood & Tissue Kit (Qiagen) and the RNeasy Kit (Qiagen), respectively. For gDNA isolation, frozen cell pellets were used. For the collection of RNA, cells were cultured in the corresponding culture medium and immediately lysed with RLT buffer (Qiagen) containing β-mercaptoethanol and homogenized with QIAshredder columns (Qiagen) before proceeding with RNA isolation. gDNA and RNA concentrations were determined using a Qubit fluorometer (Thermo Fisher Scientific).

Genomic sequencing of MCCA lines

Whole-exome sequencing (WES) was performed using 450 ng of gDNA from mouse cell lines and matched normal samples from mouse tail biopsies. Coding exons were enriched by whole-exome pull-down using the Agilent SureSelect XT Mouse All Exon Kit according to the manufacturer’s instructions and sequenced on the NovaSeq 6000 (Illumina) system.

Low-coverage whole-genome sequencing (lcWGS) was performed using 200 ng of gDNA from mouse cell lines and matched normal samples from mouse tail biopsies when available. Libraries were prepared using the TruSeq DNA Nanokit (Illumina) according to the manufacturer’s instructions. The resulting libraries were analysed on the 2100 Bioanalyzer instrument (Agilent Technologies) and sequenced on the NextSeq 550 (Illumina) or NovaSeq 6000 (Illumina) system.

Analysis of genomic sequencing data

The analysis of WES data from mouse tumour–normal sample pairs was performed according to the GATK best practice suggestions. The established MoCaSeq analysis pipeline (v.0.4.54)¹³ was used for processing all samples. Raw BCL files were converted into demultiplexed FASTQ files using bcl2fastq (v.2.20.0.422). Raw sequencing reads were trimmed using Trimmomatic (v.0.39)⁵⁸, removing leading and trailing bases with Phred scores below 25 and reads with less than 50 nucleotides. Moreover, an average base quality of 25 was enforced with a sliding window of 10 nucleotides for the reads. Passing reads were then aligned to the GRCm38.p6 reference genome using BWA-MEM (v.0.7.17)⁵⁹ with the default settings. The mapped reads were processed using samblaster (v.0.1.26)⁶⁰, sambamba (v.0.7.0)⁶¹ and Picard tools (v.2.20.0). Mutect2 from the GATK toolkit (v.4.2.0.0)⁶² was used to call indels and somatic mutations with the default settings. Variants were filtered for read orientation artefacts using GATK. For each tumour sample, the corresponding normal sample was used to filter germline variants. Moreover, candidate somatic mutations were filtered for SNPs by excluding variants listed in the Wellcome Trust Sanger Mouse Genome Project SNP database (v5) (ENA study PRJEB11471). Furthermore, somatic mutations were filtered if (1) the read coverage was below 5 in both the control and tumour; (2) the variant allele frequency (VAF) was below 5%; and (3) the number of reads carrying the variant was below 2 in the tumour sample and equal to 1 or 0 in the normal sample. Annotation of somatic variants was performed using SNPeff (v.4.3)⁶³. SNVs with a low predicted impact as well as variants at non-exonic sites were excluded from further analysis. DNA tumour/normal copy ratios were determined using CNVKit (v.0.9.9)⁶⁴. The copy-number calling was performed using the batch command of the CNVKit pipeline for read coverage estimation, normalization and segmentation. The probe regions of the Agilent SureSelect XT Mouse All Exon Kit were used as on-target regions.

The analysis of WES and WGS data from human tumour/normal sample pairs was performed based on a modified version of MoCaSeq adapted to the human genome. Raw sequencing data were obtained from TCGA (dbGAP study phs000178.v11.p8) and the ICGC PanCuRx study (EGA studies EGAD00001003585, EGAD00001004551, EGAD00001006081 and EGAD00001006152). Reads were aligned to the GRCh38.p12 reference genome. Variants were called using Mutect2 from the GATK toolkit (v.4.2.0.0)⁶². Genes were annotated using SNPeff (Ensembl 92) and Gencode (v.31)⁶⁵. Potential somatic variants were filtered for SNPs by excluding SNVs listed in the GnomAD (>1%)⁶⁶ and dbSNP (>5%)⁶⁷ database. The copy ratios were determined by using the batch command of CNVKit (v.0.9.9) in WGS mode and Agilent SureSelect Human All Exon V7 exon probes (S31285117) as on-target regions. For the detection of microsatellite instability (MSI), MSIsensor (v.0.5)⁶⁸ was run with the default parameters and using the GRCm38.p6 microsatellites data provided by the authors (but no evidence for MSI was found in the MCCA; Supplementary Table 6).

The lcWGS data were analysed analogously to the WES data regarding trimming, mapping and postprocessing. Copy-number calling was performed with CNVKit⁶⁴ using the whole-genome sequencing mode combined with the Agilent SureSelect XT Mouse All Exon Kit exon probe regions as on-target regions, according to the CNVKit best practice suggestions. Postprocessing and data visualization were performed in R (v.4.4.1) using data.table (v.1.14.8), ggplot2 (v.3.4.2), pheatmap (v.1.0.12) and ComplexHeatmap (v.2.16.0).

Purity correction and gene allele state analyses

To determine purity and ploidy values for cancer tissues, we reanalysed TCGA cancer tissue samples using ABSOLUTE⁶⁹ (v.1.0.6) with the default parameters and total copy number as well as mutation data as input. Purity and ploidy estimates from ABSOLUTE were reviewed and curated based on manual inspection of copy-number profiles and SNP/SNV frequencies for each sample and compared to the purity and ploidy values as provided by the PanCanAtlas (https://gdc.cancer.gov/about-data/publications/pancanatlas). The consensus genomic tumour purity value of each sample was then used to bioinformatically adjust the VAF of SNVs as well as the copy ratio of CNV segments as they would be detected in pure cancer cells. The purity and ploidy values for cancer tissues of the ICGC PanCuRx cohort were estimated similarly to as described above (including manual review of purity/ploidy solutions and bioinformatical purity adjustment of SNV VAFs and CNV copy ratios). Only human cancer tissues with KRAS exon 2 hotspot mutations (G12* or G13*; referred to as KRAS^MUT) were considered for further downstream analyses. Moreover, human cancers with a purity below 20% were excluded from the analyses of KRAS, CDKN2A and TP53 allele states to ensure robust detection of allelic imbalances and/or deletions. Estimating tumour purity and gene allele state based on genomics data from bulk tissue samples can be complicated by purity–ploidy ambiguity (multiple possible combinations matching a single genomic profile), intratumour heterogeneity (subclonal copy-number alterations appear with attenuated log₂ ratios) and low tumour content (increasing stroma and immune cell admixture masking tumour-derived signals). These potential confounders were accounted for in the data analysis by manual review of all candidate purity–ploidy solutions through two independent experts, by evaluation of allelic imbalance in the dominant tumour clone and by exclusion of samples with tumour purity below 20%, respectively. Given the typical low purity of pancreatic cancer tissue samples, analyses of gene allele states in this entity were primarily performed using the ICGC PanCuRx cohort, which encompasses genomics data generated from laser microdissected pancreatic cancer tissues (as opposed to the TCGA-PAAD dataset).

For microdissected MCCA tissues, purity correction of Kras^G12D VAFs was performed based on the quantification of non-recombined Kras^LSL-G12D alleles. Custom-designed TaqMan quantitative PCR (qPCR) assays were used to determine Kras^LSL-G12D allele and total Kras locus copy (Kras^Copy) quantities (Supplementary Table 19). Kras^LSL-G12D quantities were normalized to Kras^Copy to account for potential copy-number changes at the Kras locus in cancer cells. Normalized Kras^LSL-G12D values directly reflect stroma contamination and were used for bioinformatical purity adjustment of tissue-based Kras^G12D VAFs. For this, contaminating stroma reads were subtracted from tissue-based amplicon-based next-generation-sequencing data of the Kras locus to finally obtain pure Kras^G12D VAFs.

For the analysis of gene allele states, processed VAFs and copy number ratios were integrated. Purity-adjusted VAF and copy ratio (CR) values were used for cancer tissues, but not cell lines. For analyses of PAAD, NSCLC and COAD cohorts of the CCLE dataset⁶, processed VAF and CR data were downloaded from the cBioPortal study, Cancer Cell Line Encyclopedia⁶ (https://www.cbioportal.org/study/summary?id=cellline_ccle_broad). Kras^G12D (mouse) and KRAS^MUT (human) allelic states were classified using the following thresholds for VAF and CR: dGD (0.05 ≤ VAF < 0.4), HET (0.4 ≤ VAF < 0.61), gain (VAF ≥ 0.61 and 1.3 ≤ CR < 2.8), amp (VAF ≥ 0.61 and CR ≥ 2.8) and LOH (VAF ≥ 0.61 and CR < 1.3). Cdkn2a (mouse) and CDKN2A (human) allelic states were classified as follows: WT (VAF = 0 and CR ≥ 0.87), HET (0 < VAF < 0.85 or 0.19 < CR < 0.87) and HOM (VAF ≥ 0.85 or CR ≤ 0.19).

Analysis of TMB and CNV load

For the analysis of TMB in MCCA, cell lines with available WES data of the cancer and matched normal control were used, resulting in a final set of 190 samples from the intestine, liver, lung, pancreas and stomach. Somatic mutations were retained if they met the following criteria: (1) read coverage ≥10 at the variant site in both tumour and matched normal sample; (2) VAF ≥ 10%; (3) ≥3 variant-supporting reads in the tumour; and (4) no variant-supporting reads in the matched normal. Protein-coding exon coordinates were obtained from the GENCODE M25 annotation (Mus musculus), including all exons from protein-coding transcripts. Overlapping regions were collapsed to generate a non-redundant, non-overlapping set of exonic intervals representing the protein-coding exome. Mutations outside these regions were excluded. TMB was calculated as the number of all protein-coding exonic mutations divided by the total exonic length (in megabases). A detailed description for the analysis of the ‘effective’ pTMB in different transplantation scenarios is provided in the ‘Immunophenotyping’ section. For the TMB analyses in human cancer cell lines, pre-processed somatic mutation calls were retrieved from DepMap 24Q4 (https://doi.org/10.25452/figshare.plus.27993248.v1; file, OmicsSomaticMutations.csv)⁵⁰ for the same set of CCLE samples of ref. ⁷ that was also used for the cross-species comparison of MCCA and CCLE transcriptomes. Samples corresponding to the same tissue types as those selected in MCCA were retained, resulting in 336 CCLE samples. The exome was defined using GENCODE v38 (Homo sapiens) according to the same collapsing strategy as for MCCA. As matched normal samples are not available for CCLE, mutations were retained if they met the following criteria: (1) read coverage ≥ 10; (2) VAF ≥ 10%; and (3) ≥3 variant-supporting reads in the tumour. TMB was computed as above (and pTMB as described in the ‘Immunophenotyping’ section). For TCGA samples, WES data were obtained for tumours and matched normal controls as described above. Samples with annotated low quality or low tumour purity estimates (<40%) were excluded. To avoid patient-level redundancy, when multiple samples were available from the same individual, one sample was randomly selected. Samples matching the selected MCCA tissue types were retained, yielding 1,551 samples. For the selected samples, somatic mutation calls were obtained from the GDC data portal and VAFs were purity-corrected as described below. Mutations were filtered using the same criteria as for MCCA, and the exome was defined as for CCLE. TMB was calculated using the same approach.

For the analysis of CNV load in MCCA, copy-number profiles generated by CNVkit (see the ‘Analysis of genomic sequencing data’ section) were obtained for 590 MCCA cell lines as determined by WES (n = 200) or lcWGS (n = 390). Tissues represented by at least five MCCA lines were selected for downstream analyses, including soft tissue, bile duct, intestine, liver, lung, pancreas, stomach, lymphoid T, lymphoid B, myeloid, nervous system and oesophagus, resulting in a final set of 562 samples. CNV load was calculated as the percentage of the autosomal genome affected by copy-number alterations. In detail, genomic segments with an absolute log₂-transformed copy ratio of >0.2 were defined as altered, and their total length was divided by the total autosomal genome length and multiplied by 100. For the CCLE cell lines reported in ref. ⁷, preprocessed copy-number segment profiles were obtained from cBioPortal¹⁵. Samples corresponding to the same tissue types as those selected in MCCA were retained, resulting in 606 samples. CNV load was computed using the same approach as for MCCA using R (v.4.4.1) and with data.table (v.1.14.8). For TCGA, pre-processed copy-number segment data were obtained as described above and samples with annotated low quality or low tumour purity estimates (<40%) were excluded. When multiple samples were available from the same individual, one was randomly selected. Samples matching the selected MCCA tissue types were retained, yielding 1,962 samples. Copy-number ratios for the selected samples were retrieved from the GDC portal and purity-corrected as described below. CNV load was calculated as described for MCCA and CCLE.

The weighted genome instability index (wGII) was determined as described previously⁷⁰. wGII estimates copy-number instability based on the proportion of the genome with aberrant copy number compared with the median ploidy, weighted on a per chromosome basis; and correlates with copy-number instability in cancer cell lines.

Immunophenotyping

For the analysis of mouse genetic background (strain) from genomic sequencing data, a library of strain-specific signature SNPs was first established. For this, a total of 21,923,209 SNPs was extracted from the whole-genome sequencing catalogue of the Mouse Genomes Project, which comprises 29 widely used inbred mouse strains²⁰. Pearson correlation of SNP patterns was calculated for each pairwise inbred strain combination. Correlation values were used for hierarchical clustering to assign the 29 inbred mouse strains into 15 genealogically related strain groups. Strain-specific marker SNPs were required to be unique to a particular strain group but were allowed to be shared within the same group. In total, 1,097,314 genome-wide signature SNPs were determined. This library of signature SNPs was then used to infer the percentage strain composition of MCCA lines from genomic sequencing data. For this purpose, the genome was binned into consecutive 10 Mb bins. Strain-specific signature SNPs were assigned to each bin according to their genomic position. Genomic sequencing data were used for performing variant calling and SNP detection in MCCA lines. The list of identified SNP variants was matched against the library of strain-specific signature SNPs. An enrichment score was calculated for each bin per strain, based on the number of matching SNPs and normalized to the number of expected SNPs. The length of all bins with a sufficient enrichment score was summed to derive a genome-wide enrichment score for each of the 29 inbred strains. Genome-wide enrichment scores for the top three highest scoring strains were reported for each MCCA line (Supplementary Table 5).

The backcross status of MCCA lines was estimated from sequencing data (genomic strain composition; Supplementary Table 5) based on the following equation:

$${\mathrm{cBS}}_{{N}}={\log }_{2}\,\left(\frac{2\times 100 \% }{\sum {\mathrm{FS}}_{ \% }-\sum {\mathrm{AS}}_{ \% }}\right)$$

where cBS_N is the computational backcross status, $\sum {{\rm{FS}}}_{ \% }$ is the sum of foreign strain contribution and $\sum {{\rm{AS}}}_{ \% }$ is the sum of allele-clustered SNP contribution). Three specific considerations are accounted for in the equation. Related to ${\log }_{2}()$: each successive backcross generation reduces the genetic contribution of the original strain background by 50% in the genome of the backcrossed mouse. Backcross generation can therefore be inferred by calculating the log₂ from the fraction of the original strain background remaining in the backcrossed mouse genome. Related to $(2\times 100 \% )$: Both copies of a diploid genome (the maternal and paternal) need to be backcrossed and considered in the analyses. Related to $(\sum {{\rm{FS}}}_{ \% }-\sum {{\rm{AS}}}_{ \% })$: SNPs clustered around non-congenic alleles withstand backcrossing and thereby confound the calculation of backcross status. We therefore filter SNPs found in close genomic proximity to non-congenic mouse alleles. To this end, the genetic background (strain) of the ES cell, in which a mouse allele was originally engineered, was annotated for the mouse alleles present in each individual MCCA line (Supplementary Table 1). This information was then used to remove SNPs corresponding to the strain of a given allele within a 25 Mb window of its genomic integration site (for mouse alleles with unknown genomic integration sites, a value of 1% per allele was subtracted from the total enrichment score of the corresponding genetic background). This procedure was performed for all genetically engineered alleles of a given MCCA line to finally compute allele-adjusted enrichment scores for each detected strain contribution (Supplementary Table 5).

For MHC haplotype analysis from genomic sequencing data, a library of MHC-specific signature SNPs was generated first. For this, the MHC locus was divided into 6 gene clusters (H2-K, -A, -E, -D, -Q and -T) based on their MHC subclass assignment (class I or II, classical or non-classical). The identification of MHC-specific signature SNPs was performed in a way that is comparable to the identification of strain signature SNPs but for each H2 gene cluster individually. First, a total of 375,097 MHC SNPs was derived from the whole-genome sequencing catalogue of the Mouse Genomes Project, comprising 29 inbred mouse strains²⁰. In the second step, Pearson correlation of SNP patterns with hierarchical clustering was used to define MHC haplotypes for each MHC gene cluster individually. Identified H2 gene cluster haplotype groups were verified using existing MHC haplotype information, if available (https://www.imgt.org/IMGTrepertoireMH/Polymorphism/haplotypes/mouse/MHC/Mu_haplotypes.html). For each individual H2 gene cluster, signature SNPs were required to be exclusive to a particular haplotype group but were allowed to be shared within the same group. In total, 44,219 MHC gene cluster signature SNPs were identified. This panel of signature SNPs was finally used to determine H2 gene cluster-specific MHC haplotypes for MCCA lines based on genomic sequencing data. Such as for the genome-wide strain analysis, MHC gene clusters were binned (here into segments of 1 Mb size) and the enrichment score for each bin was calculated. On the basis of the enrichment scores, the MHC haplotype was assigned to each of the 6 H2 gene clusters, which, in combination, defines the full MHC haplotype of an individual MCCA line (Supplementary Table 5).

For sex analysis, the number of uniquely mapped reads was calculated for each chromosome using samtools coverage (v.1.17). Next, the ratio of Y-chromosome-specific read counts over the sum of all autosome-specific read counts was calculated. On the basis of samples with available sex annotation, a cut-off of 0.05 was selected for the Y chromosome mapping ratio to assign male (≥0.05) or female (<0.05) sex (Supplementary Table 5).

For the quantification of the ‘effective’ pTMB of MCCA lines in defined immunocompetent transplantation scenarios, information on genetic background (strain), MHC haplotyping and TMB were integrated. The TMB was analysed as described in the ‘Analysis of TMB and CNV load’ section, but with two modifications. First, variants detected in the tumour were not filtered for their presence or absence in the matched normal sample, as somatic and germline variants need to be considered for this type of analysis. Second, as only non-synonymous (protein-altering) variants can be potentially immunogenic in immunocompetent settings, the TMB output was filtered for variants with a predicted impact classified as MODERATE or HIGH based on the Ensembl Variant Effect Predictor (referred to as the TMB of protein-coding alterations, pTMB). These variants were then annotated as either somatic (present in the tumour, absent in the normal) or germline (present in both tumour and normal). Germline variants were further required (1) to be reported as a strain-specific germline variant in the SNP catalogue of the Mouse Genomes Project²⁰; (2) to match the genetic background annotation of the corresponding MCCA line as provided in Supplementary Table 5; and (3) to localize outside of SNP-dense genomic regions (which are indicative of errors in the genome assembly, as reported previously⁷¹). The ‘effective’ pTMB of a given MCCA line was then calculated from its protein-altering germline variants that do not match the genetic background of the corresponding recipient mouse (plus the protein-altering somatic mutations found in the same MCCA line). As the ‘effective’ pTMB of an MCCA line is recipient dependent, two distinct scenarios of immunocompetent transplantation were analysed. Scenario 1: MHC and the first (if possible, also the second) most dominant genetic background detected in an MCCA line match the recipient (reduced contribution of germline variants to the ‘effective’ pTMB). Scenario 2: MHC but only the most dominant genetic background detected in an MCCA line do match the recipient (elevated contribution of germline variants to the ‘effective’ pTMB).

3′ RNA-seq

Library preparation for bulk-sequencing of poly(A)-RNA was done as described previously⁷². In brief, barcoded cDNA of each sample was generated with Maxima RT polymerase (Thermo Fisher Scientific) using oligo-dT primer containing barcodes, unique molecular identifiers (UMIs) and an adaptor. The ends of the cDNAs were extended by a template switch oligo (TSO) and full-length cDNA was amplified with primers binding to the TSO site and the adaptor. The NEBNext Ultra II FS kit was used to fragment cDNA. After end-repair and A-tailing, a TruSeq adapter was ligated, and 3′-end fragments were finally amplified using primers with Illumina P5 and P7 overhangs. In comparison to a previous study⁷², the P5 and P7 sites were exchanged to allow sequencing of the cDNA in read1 and barcodes and UMIs in read2 to achieve a better cluster recognition. The library was sequenced on the NextSeq 550 (Illumina) system with 63 cycles for the cDNA in read1 and 16 cycles for the barcodes and UMIs in read2.

The 3′ RNA-seq data were processed using the published Drop-seq pipeline (v1.0) to generate sample- and gene-wise UMI tables⁷³. The reference genomes GRCm38 and GRCh38 were used for alignment of mouse and human samples, respectively. Transcript and gene definitions were used according to the Gencode (v.38)⁶⁵. The data were processed in R using the DESeq2 package (v.1.46.0) for read normalization and variance stabilizing transformation⁷⁴.

Analysis of transcriptome data

Batch correction was performed to obtain a single, full MCCA transcriptome dataset. Transcriptomes of MCCA lines were generated by 3′ RNA-seq in three independent batches (batch B1–B3), and each individual cell line was sequenced in technical replicates. To facilitate batch correction, the largest batch (B1) included 53 reference samples that were matched to B2 (27 reference samples) and B3 (26 reference samples). B1, B2 and B3 count matrices were transformed using the variance stabilizing transformation (vst()) function of the DESeq2 package (v.1.46.0)⁷⁴. Next, the batch effect was examined based on the clustering of MCCA batches and reference samples in dimensionality-reduction plots (PCA and UMAP). One technical replicate from B1 was removed owing to a clear separation from the other technical replicates of the same sample. Next, the duplicate correlation (dupcor()) function of limma (v.3.5.4)⁷⁵ was used to compute the correlation of technical replicates for matching reference samples across batches. The dupcor() function estimates the correlation between duplicates (here matching reference samples) by fitting a mixed linear model individually for each gene. Finally, to remove the batch effect, the removeBatchEffect function of limma was applied by using lineage information (as covariate), the batch information and the consensus duplicate correlation (returned by dupcor). The resulting batch corrected expression matrix was evaluated for the absence of batch effects using dimensionality-reduction plots (PCA and UMAP). Last, the gene coefficients obtained from removeBatchEffect were retained and the duplicated reference samples (from B2 and B3) were removed from the dataset.

GSVA was performed on rlog-normalized gene expression data using the GSVA R package (v.1.52.3)⁷⁶. Gene set libraries MSigDb-Hallmark-2020, KEGG-2021 and WikiPathway-2021 of enrichR (v.1.62.0)⁷⁷ as well as gene sets described previously^26,55,56 were used for analyses. Gaussian kernel was used for nonparametric estimation of the cumulative distribution function of (sorted) expression levels, and normalized GSVA scores were extracted.

To assess pathways enriched in hepatic cell lines of MCCA cultured in 2D versus 3D Matrigel conditions, RNA-seq raw counts were normalized using the median-of-ratios method and variance-stabilized with the rlog transformation in DESeq2 (v.1.46.0) under R (v.4.4.1). Gene set enrichment analysis was conducted in enrichR (v.1.62.0), using the MSigDb-Hallmark-2020 gene set library.

For KRAS^G12D overexpression experiments, count matrices were transformed using variance-stabilizing transformation as described in the ‘3′ RNA-seq’ section. Genes with a low sum of counts across all samples were removed (≥30 for CACO2, HBEC3KT and HPDE; ≥60 for HCEC1CT, MODEK and 266-6). The PCA was conducted on rlog-normalized data by selecting the top 10% of the most variable protein-coding genes based on their s.d. across samples. For a given principal component, the 250 genes with the highest positive loadings and the 250 genes with the most negative loadings were extracted. Loadings represent the coefficients that quantify how strongly each gene contributes to the variance captured by a principal component. Positive and negative loadings correspond to genes driving the separation of samples toward opposite directions along the principal component axis, reflecting, for example, the KRAS^G12D dosage-dependent induction of transcriptional changes on PC1. Gene set enrichment analysis of these gene sets was performed using enrichR (v.1.62.0)⁷⁷. The gene set libraries MSigDb-Hallmark-2020, KEGG-2021 and GO-Biological-Process2023 of enrichR (v.1.62.0)⁷⁷ were analysed.

For the comparison of transcriptomes of acinar cells 0 and 24 h after explantation from Ptf1a^cre/+;Kras^LSL-G12D/+ mice, RNA-seq raw counts were normalized using the median-of-ratios method and variance-stabilized with rlog transformation in DESeq2 (v.1.46.0) under R (v.4.4.1). Differential expression analysis was performed with DESeq2, and genes were considered to be differentially expressed at a FDR < 0.05. Significance values for Kras were extracted accordingly.

Raw count RNA-seq data from the TCGA-LUAD were downloaded through the GDC data portal (https://portal.gdc.cancer.gov/). Only cancer tissues with KRAS exon2 hotspot mutations (G12* or G13*) were considered for analysis. Differential expression analysis was performed in R (v.4.4.1) with Limma (v.3.5.4)⁷⁵. A gene was considered to be differentially expressed if its FDR-adjusted P value was below 0.05. Gene set enrichment analysis was performed using enrichR (v.1.62.0)⁷⁷ with the gene set libraries MSigDb-Hallmark-2020 and KEGG-2021. Postprocessing and data visualization were performed in R (v4.4.1) using data.table (v.1.14.8), ggplot2 (v.3.4.2), pheatmap (v.1.0.12) and ComplexHeatmap (v.2.16.0).

Genome and transcriptome stability of cell lines

To determine the stability of genomes and transcriptomes of mouse cancer cell lines, pancreatic cancer cell lines that have been cultured and characterized multiple times over a period of 10 years (up to 13 passages difference) was used (Supplementary Table 2). Cell lines that contained mixed populations of epithelial and mesenchymal cells were not included (to avoid confounding effects on the analysis of transcriptome stability), resulting in a set of 30 cell lines shared between both datasets.

The stability of mouse pancreatic cancer cell line transcriptomes was assessed by comparing cell lines cultured and profiled by RNA-seq in 2022 (data from MCCA) and 2016 (data from ref. ¹⁸). RNA-seq raw counts from both cohorts were normalized using the median-of-ratios method and variance-stabilized with rlog transformation using DESeq2 (v.1.46.0) in R (v.4.4.1). The top 10% most variable genes, ranked by s.d., were selected and used to cluster the samples by Euclidean distance and complete linkage. Cell lines were assigned into transcriptomic clusters in each dataset to assess the similarity of transcriptome clustering for each individual sample over time and passages.

Genome stability was investigated by comparing the log₂-transformed copy ratio values of copy-number profiles for the same cell line as detected by WES in 2012 (data from ref. ¹⁸) or lcWGS in 2022 (data from MCCA). Both datasets were analysed using MoCaSeq (described in the ‘Analysis of genomic sequencing data’ section). Copy-number segments generated by HMMCopy were binned to 1 Mb intervals for each chromosome. If a bin contained two different log₂-transformed values, a weighted mean (based on the size of each overlap) was calculated to assign a single log₂-transformed value to the bin. The bins were then smoothed using the median and a window size of 5 bins, excluding chromosomes associated with chromothripsis and other complex rearrangements. For a cell line with sufficient copy-number changes (median log₂ > 0), the median log₂-transformed value was subtracted from every bin to recentre the data. To quantify the stability of a cell line, the Euclidean distance between the log₂-transformed copy ratio value of WES and lcWGS data was calculated per bin and averaged across each chromosome.

To compare the genome stability of MCCA lines to human models, two separately generated copy-number datasets of large-scale human cancer cell line projects were used: CCLE (Broad Institute) and Genomics of Drug Sensitivity in Cancer (GDSC, Sanger Institute). Genome stability was investigated by comparing the log₂-transformed copy ratio values of copy-number profiles for the same cell line as detected by WES (CCLE, data from ref. ⁷) or SNP array (GDSC, data obtained from https://cellmodelpassports.sanger.ac.uk/downloads). The copy-number profiles of 625 cell lines matched between CCLE and GDSC were analysed using the same analytical workflow as described above for the comparison of mouse cancer cell line genomes. Notably, a direct comparison of genome stability in mouse and human cancer cell lines has certain limitations. For example, while the majority of cell lines in CCLE and GDSC (473 out of 625) were obtained from the same supplier (Supplementary Table 2), the exact number of passages separating individual cell line pairs in both projects is unclear. Furthermore, the increased genome instability in human cell line pairs might be linked to the increased copy-number load observed in human cancers.

Mouse–human cross-species comparison of cancer genomes and transcriptomes

For the mouse–human cross-species genome analyses, CNV profiles of MCCA lines were generated using CNVkit (v.0.9.9) with a bin size of 20 kb. Germline CNVs were manually filtered out before the analysis. For each cancer entity, consensus CNV profiles were obtained by binning the genome and calculating the average of normalized copy-number values at each genomic bin across all cancers of a given disease type. The resulting consensus plots provide entity-specific CNV landscapes and were used for the annotation of orthologues of recurrently amplified or deleted human cancer genes. These genes were identified for each disease type individually, by first assembling all genes affected by recurrent copy-number alterations in the corresponding human cancer type (CCLE and TCGA cohorts). In a second step, this list of genes was filtered for those genes reported in the Cancer Gene Census. Finally, a representative set of cancer genes was selected for each disease type, based on gene amplification/deletion frequency and literature search. These overlays enable direct comparison of mouse and human CNV patterns, highlighting conserved lineage- and entity-specific alterations.

For the mouse–human cross-species transcriptome analyses, gene-level RNA-seq count data for human cancer cell lines were obtained from the CCLE dataset provided by ref. ⁷ (file, CCLE_RNAseq_genes_counts_20180929.gct.gz). Counts were normalized using the median-of-ratios method and variance-stabilized by vst transformation using DESeq2 (v.1.46.0) in R (v.4.4.1). For mouse cancer cell lines from the MCCA dataset, previously generated batch-corrected variance-stabilized counts were used. Only tissues and tumour types represented in both MCCA and CCLE datasets were retained. Tissue nomenclature was manually curated to harmonize labels between datasets. Human–mouse orthologous gene names were obtained from Ensembl (v.103, https://doi.org/10.1093/nar/gkae1071) using biomaRt (v.2.60.1), retaining only 1:1 orthologues (n = 13,134). Genes with one-to-many or many-to-one relationships were excluded. For each tissue, the top 10% most variable orthologous genes were selected based on standard deviation, separately for MCCA and CCLE. The intersection of these sets was defined as the set of common top variable genes. Pearson correlation coefficients were computed between MCCA and CCLE samples using these genes. The resulting correlation matrices were clustered using the Ward.D2 method. The morphology of human pancreatic cancer cell lines was classified based on the expression of epithelial (CDH1 associated) and mesenchymal (VIM associated) gene signatures described previously⁷⁸. The upregulation/downregulation of each signature was determined by GSVA, performed on rlog-normalized gene expression data using the GSVA R package (v.1.52.3)⁷⁶.

Epigenetic analyses

ROADMAP RNA-seq and ChIP–seq data shown in Fig. 5c were obtained from the NIH ROADMAP epigenomics web portal (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html)³⁵. For the pancreas (E098), lung (E096), small intestine (E109) and large intestine (E106) tissues, the consolidated and normalized bigWig files (GRCh37, hg19) were converted to the bedGraph format using the UCSC utility tool bigWigToBedGraph (v.1.04.00)⁷⁹. For quantification of CDKN2A and KRAS mRNA expression levels, the normalized reads per kilobase million counts were used. For analysing histone modifications at the CDKN2A or KRAS locus, the signal scores were calculated as negative log₁₀ of the Poisson P values.

For the chromHMM analysis provided by ROADMAP, the core 15-state model was selected, and the joint mnemonics bed files were obtained from the NIH ROADMAP epigenomics web portal³⁵ (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html). This model was previously trained by integrating 5 chromatin marks (H3K4me3, H3K4me1, H3K36me3, H3K9me3, H3K27me3) for 127 reference epigenomes, whereby the chromatin state with the highest posterior probability given by the model was assigned to each genomic bin. The states were additionally stratified into eight active (TssA, TssAFlnk, TxFlnk, Tx, TxWk, EnhG, Enh, ZNF/Rpts) and seven inactive or repressed (Het, TssBiv, BivFlnk, EnhBiv, ReprPC, ReprPCWk, Quies) states. The chromHMM data were filtered to the pancreas (E098), lung (E096), small intestine (E109) and large intestine (E106) tissues and processed/harmonized using RNA-seq and ChIP–seq analyses.

For the quantitative comparison of ChIP–seq signals at CDKN2A and KRAS across healthy human tissues (Extended Data Fig. 10a,b,d,e), H3K4me3 and H3K27me3 ChIP–seq raw data of ENCODE and ROADMAP reference epigenomes were downloaded as FASTQ format from the ENCODE web portal^80,81 (Supplementary Table 15). All datasets were processed with nf-core’s ChIP–seq pipeline (v.2.0.0)⁸², run separately for each histone mark, with narrow settings for H3K4me3 and broad settings for H3K27me3. Normalized coverage tracks (BigWig files) generated by the pipeline were then used to compute the H3K4me3 or H3K27me3 signals, defined as the sum of continuous signal values across the genomic region of interest normalized to its kb length. For H3K27me3, the ChIP–seq signal was quantified by determining histone occupancy across all exonic regions of CDKN2A (ENST00000304494.10 and ENST00000579755.2) or KRAS (ENST00000256078.10 and ENST00000311936.8). For H3K4me3, the ChIP–seq signal only at the promoter containing exons of CDKN2A or KRAS was analysed. The CDKN2A Ex1β promoter region was excluded from the analyses, as the H3K4me3 peak at this promoter could not be discriminated from the promoter signal of CDKN2B-AS1 (300 bp upstream of CDKN2A Ex1β).

H3K4me3 and H3K27me3 ChIP–seq data of healthy mouse tissues were obtained from refs. ^{83,84,85,86,87,88,89,90} (Supplementary Table 15). The analytical workflow described for human tissues above was applied with identical parameter settings to ChIP–seq data from healthy mouse tissues. ENSMUST00000060501.4 and ENSMUST00000107131.1 transcript annotations were used for Cdkn2a. This approach provided a quantitative measure of H3K4me3 and H3K27me3 signals that could be compared across tissues and between species.

Single-cell sequencing and pseudobulk analyses

Single-cell RNA-seq datasets from multiple human tissues were processed for tissue-specific pseudobulk analysis using Python (v.3.9.12) with Scanpy (v.1.9.3), Pandas (v.1.5.3) and Numpy (v.1.24.4). Datasets included epithelial cells from the pancreas, lung, rectum, stomach, bladder and uterus as well as small and large intestine. For the pancreas, lung and intestinal datasets, raw count matrices were downloaded^38,39,40 and processed by filtering cells based on gene counts (min., 300; max., 2,500) and mitochondrial content (<50%), removing genes expressed in fewer than three cells, normalizing and log-transforming. Next the PCA, neighbourhood graphs (n_neighbours = 50), Leiden clustering and UMAP embeddings were computed. Only epithelial cells were retained, and clusters with fewer than 100 cells were excluded. For the bladder⁴⁷, uterus⁴⁹, stomach and rectum⁴⁸, epithelial cells were selected from publicly available datasets, filtering for healthy samples and relevant tissue annotations. For cell-of-origin profiling, specific epithelial subtypes were extracted from each tissue: acinar and ductal (pancreas), type I and type II pneumocytes (lung), stem cells (small intestine, large intestine, rectum), surface foveolar and cycling cells (stomach), bladder urothelial cells (bladder) and non-ciliated epithelial cells (uterus). CDKN2A and KRAS mRNA expression levels were determined through pseudobulk analysis. For each group of cells in a specific tissue, the counts of gene-specific reads detected per cell were summed and divided by the sum of total reads detected in all cells of an individual donor. These normalized gene mRNA expression levels were independent of cluster size and sequencing depth, and therefore comparable across cell types and datasets. Donor samples contributing <50 cells of a particular group of cells were aggregated into a single sample. Donor-specific pseudobulk CDKN2A and KRAS mRNA expression values were used for statistical comparison of gene expression between cell types.

A comparable analysis was performed for mouse single-cell RNA-seq data⁹¹, focusing on the lung, pancreas and intestine datasets. Similarly, samples were subset based on the cell-of-origin classifications defined for the human data, and pseudobulk aggregation was applied to these samples.

For the analysis of scRNA-seq data from ref. ²⁹, the dataset was subset to defined cell states of Kras^G12D-driven pancreatic cancer evolution: Kras^G12D-mutant acinar cells (Cpa1⁺tdTomato⁺Krt19⁻) and Kras^G12D-mutant metaplastic ADM/PanIN cells (Cpa1⁻tdTomato⁺Krt19⁺), as well as Kras^G12D-mutant cancer cells of an additional 15-month-old mouse. Kras^WT acinar cells (Cpa1⁺tdTomato⁻Krt19⁻) of the dataset were used as control. These four groups of cells were assigned based on the localization of marker gene expression in UMAPs, which were computed from a PCA (n_combs = 30), neighbourhood graphs (n_pcs = 30) and Leiden clustering. The normalized mRNA expression of Kras or Cdkn2a was determined by calculating the mean of normalized read counts in each cell state and were statistically compared across independent samples.

In vivo transposon mutagenesis screens, QiSeq and CIS analysis

In vivo transposon mutagenesis screens in the pancreas and intestine have been performed as described previously^42,43,52. Quantitative transposon insertion site sequencing (QiSeq) was performed as described previously^43,44. In brief, genomic DNA was sheared to around 250 bp fragments using a Covaris M220 ultrasonicator. Fragmented DNA was end-repaired, A-tailed and ligated to a Y-shaped splinkerette adapter consisting of a double-stranded region with a hairpin loop and a single-stranded overhang. Library preparation was carried out separately for the 5′ and 3′ transposon ends. In a first PCR, junction-containing fragments were selectively amplified using transposon-specific and splinkerette-specific primers, as the adapter design prevents priming from the splinkerette during the initial cycle. A nested PCR introduced Illumina P5 and P7 adapters together with sample-specific barcodes. Libraries were quantified by qPCR with P5/P7-specific primers, pooled in equimolar amounts and sequenced on the Illumina MiSeq platform (paired-end, 65 bp). For the analysis of sequencing raw data, reads containing transposon-genome junctions were extracted, aligned to the mm10 reference genome using SSAHA2 and collapsed to unique insertion sites per tumour, with read counts quantified per site. For the pancreas samples, sequencing data were obtained from a previous study⁴³ and analysed using the same computational workflow.

The subsequent common insertion site (CIS) analysis was conducted as follows. For each sample, insertion counts were normalized to library size and expressed as counts per hundred reads by dividing the counts per insertion by the total number of reads and multiplying by 100. Insertions with normalized counts of <0.02 were excluded. Insertion coordinates from all samples of the same cohort were merged into a single BED file. CISs were identified with MACS2 (v.2.2.7.1) (https://github.com/macs3-project/MACS) using four window sizes (5, 30, 60 and 100 kb) with genome size, shift and extension parameters adjusted accordingly, and with the nomodel and nolambda options enabled. Peaks were filtered at q < 0.05, and neighbouring peaks within 10 kb were collapsed. CISs were annotated with overlapping and flanking genomic features. Insertion sites with fewer than ten supporting reads and/or normalized counts of <0.02 were excluded from further analysis. Insertion sites within the Cdkn2a locus (including Cdkn2a and Ncruc/Gm12610) or in WNT pathway-related genes (including Apc, Ctnnd1, Rnf4 and Rspo2) were then ranked according to sequencing read coverage and normalized insertion counts relative to all other insertions in the respective cancer. Finally, Cdkn2a locus or WNT pathway gene transposon insertions were classified as either high-ranked (top 10) or low-ranked based on their rank in each cancer sample. Assuming that the stronger the selective advantage conferred by a gene perturbation, the more pronounced the expansion of the affected cell, such ranking can serve as a direct measure for the selective advantage conferred by a transposon insertion during cancer evolution.

Orthotopic transplantation

For mPACA transplantation experiments, 2,500–10,000 cancer cells were orthotopically grafted into the pancreas of syngeneic immunocompetent C57BL/6J or 129 WT mice. For orthotopic transplantation, mice were anaesthetized with a combination of medetomidine, midazolam and fentanyl. The left flank was carefully shaved, the eyes were protected with ointment and the abdomen was disinfected. When anaesthesia was complete, a 1.5 cm left-lateral incision caudal to the spleen was made, and the pancreas was located, and was then carefully pulled out of the abdomen to make it accessible for intraparenchymal injection. Cell suspension was administered slowly using a 27 G needle at a depth of 3–4 mm. The needle was left in this position for at least 30 s to avoid leakage of the bubble. The pancreas and spleen were carefully placed back in their anatomical position and covered with PBS to avoid organ adhesion. The peritoneum was closed with interrupted sutures (5-0 Ethilon) and the skin with wound clips. The mice were kept in a 37 °C heating chamber until they woke up.

For mCACO, transplantation experiments were performed as previously described⁹². In brief, organoids were dissociated into clusters of 5–10 cells and resuspended in a transplantation medium consisting of PBS, B27, N2, l-Glutamine (Thermo Fisher Scientific), 10% Matrigel (Corning) and 10 µM Y-27632 (Stem Cell Technologies). For each injection (2–3 per mouse), 50 dissociated organoids were prepared in a volume of 100 µl. The procedure involved anaesthetizing the mice, followed by gently rinsing the colon with PBS using a syringe and a straight oral gavage needle. Colonoscopy was performed using a rigid endoscope from Karl STORZ (1.9 mm in diameter) with linear Hopkins lens optics (ColoView System). Organoids were injected into the submucosa of the colon using a flexible fine needle (Hamilton; 33 gauge, custom length of 16 inches, custom point style of 4 at 45°). Correct submucosal injections were identified by the formation of a bubble that occluded the intestinal lumen. A scoring system was used to correlate the quality of injections with the experimental outcomes.

Note that MCCA lines might occasionally not engraft in fully immunophenotype-matched hosts due to non-immunological reasons, such as a lack of specific niche factors. These effects are especially relevant when transplanting low cell numbers (<10,000), which can, however, be rescued by increasing the number of injected cells.

scAAV8-based somatic mutagenesis in mice

scAAV8-based somatic mutagenesis in mice was performed as described in detail previously²⁸. In brief, scAAV8 particles were produced by transfecting HEK293T cells with scAAV and helper plasmid pDP8.ape. scAAV8-producing HEK293T cells were collected, resuspended and lysed through repeated freeze–thaw cycles. Free nucleic acids were digested with Benzonase nuclease (Sigma-Aldrich) and scAAV8 particle purified from the supernatant using iodixanol-based gradient ultracentrifugation (Backman Coulter). Vivaspin 20 centrifugal concentrator columns (Sigma-Aldrich) and Ringer lactate solution were used for buffer exchange of the extracted scAAV8-containing iodixanol solution. scAAV8 titres were subsequently determined by qPCR. For this, scAAV8 viral capsids were first disrupted using alkaline lysis. The sample was neutralized before qPCR-based quantification of scAAV8 viral genomes. Next, 1 × 10¹² scAAV8 viral genomes were diluted in PBS and intraperitoneally injected into 8-week-old Ptf1a^cre/+;Kras^LSL-G12D/+,Rosa26^{CAG-LSL-Cas9/CAG-LSL-Cas9} mice. Pancreata were dissected from mice 8 weeks after injection of scAAV8-Tgfbr2-sgRNA or scAAV8-Rosa26-sgRNA, or from age-matched non-injected mice. Pancreas tissue was formalin-fixed and paraffin-embedded to prepare H&E stains for the quantification of acini. The acini number was determined in H&E sections by counting acini per field of view in at least five images with ×40 magnification per animal. The averaged acini count per animal was finally used to compare pancreatic remodelling across conditions.

Amplicon-based deep sequencing

Amplicon-based deep sequencing of the mouse Kras^G12D mutation, human KRAS^G12D mutation or the mouse Ctnnb1 exon3 hotspot mutations was performed using either 50 ng of genomic DNA (gDNA) or 1.5 µl of reverse-transcribed mRNA (cDNA). In brief, the exon2 of Kras and KRAS or exon3 of Ctnnb1 was amplified using Kapa HiFi HotStart ReadyMix (Roche, 30 cycles) and primers with TruSeq adaptor overhangs (Supplementary Table 19). In a second PCR step (ten cycles), TruSeq index primer sequences (Illumina) were added. After each PCR step, solid-phase reversible immobilization clean-up (0.8×) was performed using an Agencourt AMPure XP kit (Beckman Coulter). The pooled library was quantified by SYBR Green qPCR (Sigma-Aldrich) and a Kapa Biosystems library quantification kit (Roche). The resulting library was sequenced on a NextSeq 550 (Illumina) system. Raw reads were mapped to the matching mouse (GRCm38) or human (GRCh38) reference genome assembly. G12D mutation-specific VAFs were calculated at the corresponding genomic position. Ctnnb1 exon3 hotspot mutations were determined using Mutect2 from the GATK toolkit (v.4.2.0.0)⁶².

For amplicon-based deep sequencing of all mouse Apc-coding exons, 50 ng of gDNA was used as input for PCR-based amplification with pools of primers listed in Supplementary Table 19. PCR products were enzymatically fragmented, and libraries were prepared using the TruSeq DNA Nanokit (Illumina) according to the manufacturer’s instructions. After read mapping to GRCm38, mutation calling was conducted with Mutect2 from the GATK toolkit (v.4.2.0.0)⁶².

To determine Kras^G12D VAFs from laser microdissected tissue, lysates were directly prepared in the sample collection tube by adding proteinase K to a final concentration of 0.4 mg ml⁻¹ and incubating overnight at 56 °C, followed by heat inactivation at 95 °C for 15 min. Kras exon 2 was amplified using a nested PCR strategy: an initial PCR with outer primers (KAPA HiFi HotStart, 25 cycles, annealing 59 °C) was performed, followed by purification with 0.8× AMPure XP beads (Beckman Coulter). A second PCR with inner primers and Illumina TruSeq overhangs (10 cycles, annealing 55 °C) was conducted. Finally, a third PCR was used to add sample-specific barcodes and P5/P7 adapters (10 cycles, annealing 65 °C). Cycling conditions for all PCRs were as follows: 98 °C for 20 s, annealing at the indicated temperature for 20 s, and extension at 72 °C for 45 s, with a final extension at 72 °C for 2 min. A list of all of the primers used for the three PCR steps is provided in Supplementary Table 19. Final libraries were purified (0.8× AMPure XP) and sequenced on the Illumina NextSeq 1000 system. Sequencing reads were aligned to the Kras reference (GRCm38), and G12D mutation-specific VAFs were calculated at the corresponding genomic position.

cDNA synthesis and TaqMan qPCR

cDNA synthesis was synthesized from 1 mg of RNA by using SuperScript II Reverse Transcriptase (Thermo Fisher Scientific) according to standard protocols. Notably, reverse transcription was performed using random hexamers to avoid biased reverse transcription of endogenous versus lentiviral transcripts (in case oligo(dT) primers are used). TaqMan qPCR was performed using TaqMan chemistry (Thermo Fisher Scientific) and a list of the primers and probes is provided in Supplementary Table 19. Quantification of Kras^LSL-G12D and KRAS^MUT mRNA was normalized to Kras or GAPDH, respectively. TaqMan qPCR was conducted on the StepOnePlus system (Applied Biosystems).

Flow cytometry

Mouse pancreatic ductal adenocarcinoma (PDAC) cell lines were cultured under standard conditions and treated with recombinant mouse IFNγ (BioLegend) at a final concentration of 100 ng ml⁻¹ for 3 days. Untreated cells served as controls. After treatment, surface expression of MHC class I was assessed by flow cytometry. Cells were collected and transferred into 96-well V-bottom plates for staining. After centrifugation (5 min at 1,500 rpm, 4 °C), cells were washed with FACS buffer (PBS supplemented with 1% BSA and 5 mM EDTA) and incubated with an extracellular staining mix containing Fc block and either anti-MHC class I antibody (H-2Kb, AF6-88.5.5.3, eFluor 450, eBioscience, 48-5958-82) or the corresponding isotype control (mouse IgG2a κ, eFluor 450, eBioscience, 48-4724-82). Staining was carried out for 30 min at 4 °C, protected from light. After surface staining, cells were washed and incubated for 15 min at 4 °C with a viability dye (iFluor 840 maleimide, AAT Bioquest). After a final wash, cells were resuspended in FACS buffer and acquired on the CytoFlex Flow Cytometer (Beckman Coulter). Flow cytometry data were analysed using FlowJo software (v.10.10.0, FlowJo, BD). Appropriate gating strategies were applied to exclude debris, dead cells and doublets, and to quantify MHC-I surface expression.

Doxycycline-titratable gene overexpression

The pINDUCER20 (ref. ⁹³) vector system was used for doxycycline-inducible KRAS^G12D and GFP overexpression. HEK293FT cells were used for lentivirus production and maintained in DMEM supplemented with 10% FCS and 1% penicillin–streptomycin. In brief, the puromycin resistance was first exchanged with a hygromycin cassette and the cDNAs of oncogenic KRAS^G12D (CCDS 8702.1, 35G>A) or GFP were cloned in a second step into the pINDUCER20 lentiviral vector. Stbl3 bacteria (Thermo Fisher Scientific) were chemically transformed, and the plasmid DNA sequence was verified. For lentivirus production, HEK293FT cells were transfected using TransIT-LT1 (Mirus Bioscience) with standard virus packaging plasmids and the respective pINDUCER20 vectors according to the manufacturer’s recommendations. Virus-containing supernatant was pooled 48 h and 72 h after transfection, briefly centrifuged to pellet detached HEK293FT cells and filtered through 0.45-mm filters (Filtropur, Sarstedt). Lentiviral particles were stored at −80 °C until use.

For lentiviral transduction, 100,000–200,000 HPDE⁹⁴, HBEC3KT⁹⁵, HCEC1CT⁹⁶, MODEK⁹⁷ or 266-6 (ref. ⁹⁸) cells were seeded per well of a six-well plate. Acinar WT cells are not viable in vitro; thus, the acinar carcinoma cell line 266-6 was selected as model system. The cells were transduced in the presence of 1 μg μl⁻¹ polybrene (Sigma-Aldrich). Then, 2 days after transduction, cells were selected with hygromycin (Sigma-Aldrich) for at least 7 days. HPDE and HBEC3KT cells were cultured in keratinocyte-SFM medium (Thermo Fisher Scientific), supplemented with bovine pituitary extract, EGF (Thermo Fisher Scientific) and 1% penicillin–streptomycin; HCEC1CT cells in a mixture of DMEM (80%) and MEM199 (20%) supplemented with 2% FCS, EGF, insulin-transferrin-selenium, hydrocortisone (Thermo Fisher Scientific) and 1% penicillin–streptomycin; and MODEK and 266-6 cells in DMEM supplemented with 10% FCS and 1% penicillin–streptomycin. After successful transduction, the inducibility of KRAS^G12D expression was tested using 1:10 doxycycline dilution series ranging from 0.1 to 1,000 ng ml⁻¹. HCEC1CT and MODEK showed a reduced sensitivity/response of KRAS^G12D induction to doxycycline treatment. To cover the dynamic induction of KRAS^G12D expression levels across cell lines, the doxycycline concentration range was therefore adapted for HCEC1CT and MODEK. For doxycycline-titratable induction of KRAS^G12D or GFP expression, cells were either seeded in 3D (HPDE, HBEC3KT, HCEC1CT, MODEK) or 2D conditions (266-6; gelatin type A coating). For 3D conditions, 150 cells were seeded per dome consisting of 50% Matrigel (Corning). After 7 days of initial growth, target gene expression was induced for 3 days by adding the indicated amounts of doxycycline (Sigma-Aldrich) to the corresponding penicillin–streptomycin-free culturing medium. At the end point, for each cell line and doxycycline concentration, at least 20 individual spheroids were imaged to assess phenotypes and RNA was isolated by pooling four domes from the identical condition to analyse transcriptomic changes. Bright-field images were used to classify spheroids into cohesive and discohesive phenotypes. Criteria included the extent of cell-to-cell adhesion, epithelial architecture of clusters, occurrence of detached single cells, and the emergence of cell membrane protrusions as compared to the growth pattern observed for spheroids of the respective untreated model. The expert biologist was blinded for phenotype grading of bright-field spheroid images. For 2D conditions, 250,000 cells were seeded per well of a six-well plate. Target gene expression was induced the next day for 3 days using the indicated doxycycline concentrations. At the end point, RNA was isolated from one well of a six-well plate per condition. TaqMan qPCR and 3′ RNA-seq library preparation were performed as described above. The 3′ RNA-seq data analysis was conducted as described below.

Microdissection

From formalin-fixed paraffin-embedded material, one 2-µm-thick and five 10-μm-thick consecutive tissue sections were prepared and air-dried overnight. The 2 µm section was stained with H&E according to standard procedures and submitted for histopathological grading and annotation of tumour areas for microdissection. The five consecutive 10 µm sections were used for tumour microdissection. Paraffin was removed through short incubation with xylene. The specimens were briefly stained with haematoxylin and kept wet for the microdissection procedure. Individually assessed and scored samples were then microdissected under a Primovert microscope (Zeiss). gDNA was extracted using the QIAamp DNA Mini Kit (Qiagen) according to the manufacturer’s instructions, which included the use of carrier RNA to increase DNA binding during purification and a 90 °C ATL buffer incubation step to reverse formaldehyde modifications. gDNA concentrations were measured using the Qubit fluorometer (Thermo Fisher Scientific). Depending on the total available gDNA, 20–80 ng of gDNA was used as input for lcWGS, 5–20 ng for each TaqMan qPCR reaction (Kras^LSL-G12D, Kras^Copy) and 4–20 ng for amplicon-based deep sequencing of the Kras^G12D mutation. TaqMan qPCR was performed in technical quadruplicates for each target. Ratios of Kras^LSL-G12D to Kras^Copy quantifications were calculated for purity estimation of microdissected tumour tissue samples. Finally, these purity values facilitated the computational subtraction of stroma contamination from lcWGS and Kras^G12D amplicon sequencing data.

Laser microdissection

Laser microdissection (LMD) was performed using the Leica LMD6 system (Leica Microsystems) to isolate defined lesions from either cryosections or paraffin-embedded sections. For cryosections, tissue was sectioned at a thickness of 7 µm and mounted onto FrameSlides with a 4.0 µm PEN membrane (Leica Microsystems). The slides were thawed at room temperature for 1 h, briefly fixed in freshly prepared 80% ethanol for 20 s and then air-dried for at least 20 min. The sections were stained with methylene blue (1:10 dilution in distilled H₂O) for 20 s and rinsed twice in 80% ethanol, followed by an additional drying step (≥20 min) to optimize the laser-cutting efficiency. Paraffin sections (7 µm thick) were mounted onto glass slides with a 2.0 µm PEN membrane (Leica Microsystems), deparaffinized externally (Institute of Pathology, CEP) and transported in distilled H₂O. Methylene blue staining and drying steps were performed as described for cryosections. LMD was carried out using Leica software (v.8.4). Annotated regions of interest were identified under ×10 magnification using the Leica LMD software and excised. Dissected tissue was collected directly into the lids of eight-well PCR strip tubes containing 10 µl of lysis buffer (1:1 dilution in distilled H₂O; DirectPCR, Viagen Biotech), and the samples were immediately sealed and stored at −80 °C until further downstream analysis (see the ‘Amplicon-based deep sequencing’ section (the last paragraph is related to laser microdissected tissues)).

ADM ex vivo assay

Pancreata of 8-week-old Ptf1a^cre/+;Kras^LSL-G12D/+ mice were collected, cut into pieces and digested twice in McCoy’s 5A Medium (Sigma-Aldrich), containing 0.5 mg ml⁻¹ collagenase P (Sigma-Aldrich), 0.002% trypsin inhibitor from soybean (Sigma-Aldrich) and 0.1% BSA, for 10 min at 37 °C. Cells were passed through a 100 µm mesh, washed with McCoy’s 5A Medium (Sigma-Aldrich) containing 0.02% trypsin inhibitor from soybean (Sigma-Aldrich) and 0.1% BSA, and spun down for 5 min at 100g. The cells were then recovered in culture medium (Waymouth’s MB752/1 medium (Gibco), supplemented with 0.1% FCS (Merck), 1× insulin–transferrin–selenium (Gibco), 50 µg ml⁻¹ bovine pituitary extract (Gibco), 10 mM HEPES (Gibco), 0.1% BSA, 0.01% trypsin inhibitor from soybean (Sigma-Aldrich), 2.6 mg ml⁻¹ NaHCO₃ and 30% FCS) and were incubated for 30 to 60 min at 37 °C. After recovery (which defines the 0 h timepoint), acinar cells were cultured under suspension conditions in ultra-low-attachment plates using culture medium for 24 h at 37 °C (which defines the 24 h timepoint). For the isolation of RNA, acinar cells were pelleted at the defined timepoints. The 3′ RNA-seq and analysis of transcriptome data were conducted as described in the corresponding methods sections.

SA-βGal and Ki-67 staining

For SA-βGal staining, mouse tissues of mice were fixed in 4% paraformaldehyde (methanol free, for 1 h at 4 °C) and subsequently cryoprotected in 15% and 30% sucrose (each at least for 2 h at 4 °C) before being embedded in Tissue-Tek O.C.T. compound. The tissue blocks were frozen in a dry-ice ethanol bath and stored at −80 °C. Cryosections at a thickness of 5 μm were performed using the Leica Biosystems Cryostat (CM3050, Leica). β-Galactosidase staining on cryosections was performed using the Senescence β-Galactosidase Staining Kit (Cell Signaling) according to the manufacturer’s protocol. Nuclear counterstaining was performed using Nuclear Fast Red (Certistain, Merck).

For Ki-67 staining, formalin-fixed paraffin-embedded duodenal sections from WT and Vil-cre;Kras^LSL-G12D/+ mice were deparaffinized and rehydrated through graded ethanol series. Antigen retrieval was performed according to the manufacturer’s protocol. The sections were incubated with anti-Ki-67 antibody (Abcam, ab16667) at the recommended dilution, followed by detection using an appropriate HRP-conjugated secondary antibody and chromogenic substrate (according to the manufacturer’s instructions). Nuclei were counterstained with haematoxylin.

PRC2 inhibition in organoids

Intestinal and pancreatic ductal organoids were treated with a combination of A-395 hydrochloride (Sigma-Aldrich) and UNC1999 (Sigma-Aldrich), as described previously⁹⁹. Treatment was initiated immediately after seeding into Matrigel and continued over the course of two passages (pancreas, 12 days; intestine, 10 days) to not only facilitate the block of de novo H3K27me3, but also to allow for the dilution of existing H3K27me3 through cell division. To ensure sufficient cell proliferation during the treatment period, organoids were split once between passages. The final concentrations were 4 µM A-395 and 2 µM UNC1999, added directly to the organoid culture medium. Control organoids were treated with the corresponding vehicle controls (distilled H₂O for A-395, DMSO for UNC1999). Organoids were cultured under standard conditions and monitored throughout the treatment period. At the end of treatment, organoids were counted, and cell pellets were collected for RNA isolation and western blot analysis.

Western blotting of histone marks

Cell pellets were lysed in RIPA buffer (Thermo Fisher Scientific) containing protease (Pierce Mini Tablets) and phosphatase inhibitor mixes I and II (SERVA), and sheared using a Covaris M220 (20 °C, 2 min, peak power 50, duty factor 20, 200 cycles per burst). The protein concentration was measured using Pierce BCA Protein Assay Kit (Thermo Fisher Scientific), and 40 µg per sample was denatured in 5× Lämmli buffer at 95 °C for 5 min. Proteins were separated on Mini-PROTEAN TGX gels (BioRad) at 65 V (stacking) and 90 V (resolving) and transferred to 0.45 µm nitrocellulose membranes (Thermo Fisher Scientific), soaked in Power Blotter 1-Step transfer buffer (Thermo Fisher Scientific) using the Power Blotter Station PB0010 (Thermo Fisher Scientific). Membranes were blocked in 5% BSA/TBS for 1 h, incubated overnight at 4 °C with anti-H3K27me3 (tri-methyl-histone H3 (Lys27) rabbit monoclonal antibody, Cell Signaling, 9733, 1:1,000) and anti-H4 (histone H4 (L64C1) mouse monoclonal antibody, Cell Signaling, 2935, 1:1,000) in 2.5% BSA/TBST, washed and incubated with anti-mouse Dylight 680 (Cell Signaling, 5470, 1:8,000) or anti-rabbit Dylight 800 (Cell Signaling, 5151P, 1:8,000) for 1 h. After the final washes, the blots were imaged on the LI-COR Odyssey Fc system and analysed using Image Studio.

Statistics and reproducibility

For each experiment, all statistics were performed as indicated in the respective Figure and Extended Data Figure legends. Statistical testing across all classes was performed to account for multiple testing. Continuous variables were tested for normal distribution. Non-parametric tests were used for non-normally distributed data. GraphPad Prism (v.8.0.1) was used for significance calculations. Complex statistical techniques are explained in detail in the relevant Methods section.

Materials availability

MCCA lines are available from the lead contact or the original contributor on request. All detailed information can be found at the ‘Resource availability’ and ‘Contacts’ pages on www.mcca.tum.de, or in Supplementary Table 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.