Thursday, December 11, 2025
No menu items!
HomeNatureSomatic evolution following cancer treatment in normal tissue

Somatic evolution following cancer treatment in normal tissue

Tissue collection and sample preparation

Tissue was collected from patients enroled in the PEACE study (ethics approval reference 11/LO/1996).

Samples were selected from organs where no metastasis was evident during autopsy. Patients were prioritized according to the number of different organ sites available for collection, and to enable balanced sex and mixed age representation. Samples were collected from anatomical regions, snap frozen in liquid nitrogen and stored long term in the −80 °C freezer. All samples (n = 168) used in the cohort lockdown were bioinformatically assessed for presence of infiltrating cancer cells and large-scale allelic imbalance (see below) and deemed cancer-free. Additionally, for 118 cases, we performed pathology review to provide additional evidence of cancer-free status. Pathology review involved analysis of adjacent tissue to the sequenced samples that were fixed in formalin, embedded in paraffin and stained with haematoxylin and eosin before scanning with a NanoZoomer digital pathology system (Hamamatsu). Digital slides were then examined to evaluate the absence or presence of malignancy. This revealed 18 samples where adjacent tissue to the sequenced tissue either contained tumour cells or could not be confidently classified as cancer-free. Removing these samples, which were bioinformatically defined as cancer-free, did not qualitatively alter any of the results. Additionally, five metastatic samples from five different patients were collected. A 2 mm3 piece of tissue was processed for DNA extraction using the Qiagen AllPrep kit following the manufacturer’s instructions. DNA from blood was purified using the DNeasy Blood and Tissue kit (Qiagen). Purified nucleic acids were accessed for yield and purity using DNA Broad Range assay kits (Invitrogen).

Panel design

To investigate mutational processes representative of genome-wide trinucleotide content, we designed a targeted genomic panel spanning 82.5 kb focusing on 30 cancer and normal tissue driver gene regions, selected taking into account the mutation frequency in multiple cancer and normal tissue cohorts14. The regions included in our panel are detailed in Supplementary Table 2. In addition to the driver gene regions, our panel also encompasses several genomic regions with comparable representation of the genome-wide trinucleotide context that are under neutral selection, as defined by Twinstrand Bioscience. These regions are used to study the mutational processes in a context that is not influenced by selective pressures.

Library preparation and sequencing

Libraries were prepared using 1,000 ng of extracted gDNA as input into the TwinStrand DuplexSeq Library Preparation Kit as per the manufacturer’s guidelines. In brief, 1,000 ng of gDNA underwent enzymatic fragmentation, end repair and A-tailing before being ligated with unique DuplexSeq adapters followed by 10 cycles of an indexing PCR reaction. Hybrid capture was then performed using a custom 82.5-kb capture panel from TwinStrand, followed by 16 cycles of PCR amplification. Libraries underwent a second round of hybridization using the same custom capture panel, followed by another five cycles of PCR amplification. The final libraries were then quantified and assessed using the Qubit fluorometer (Thermo Fisher Scientific) and TapeStation 4200 (Agilent) before being sequenced with 150 bp paired end reads on the Illumina NovaSeq 6000 system.

NanoSeq libraries

Libraries were prepared using the NanoSeq protocol as previously described16. In brief, 2 ng of extraction gDNA was purified using a 1:1 mixture of nuclease-free water and SPRIselect beads (Beckman Coulter, B23319). Samples were then fragmented on-bead using the HpyCH4V restriction enzyme (New England Biolabs, R0620S) at 37 °C for 15 min and purified with 2.5× SPRIselect beads. The fragmented and cleaned up DNA was A-tailed and ligated with xGen CS Duplex Adapters (Integrated DNA Technologies, 1080799) and purified again with 1× SPRIselect beads, resuspending in a final volume of 20 μl nuclease-free water.

The adapter ligated libraries were quantified by quantitative PCR (qPCR) using the KAPA Library Quantification Kit (Roche, KK4828) with custom primers, as previously described16. Using the qPCR concentrations, libraries were normalized to 0.6 fmol in 20 μl nuclease-free water. The normalized libraries were added to the PCR mastermix containing 25 μl NEBNext Ultra II Q5 Master Mix (New England Biolabs, M0544S) and 5 μl xGen UDI Primers (Integrated DNA Technologies, 10008052). Libraries were amplified for a total of 13 cycles and cleaned up twice using 0.7× SPRIselect beads. The final libraries were assessed using the Qubit fluorometer (Thermo Fisher, Q33231) and Tapestation 4200 D1000 Assay (Agilent, 5067–5582). Libraries were then pooled and sequenced using 150 bp paired end reads on the Illumina NovaSeq 6000 platform, aiming for 30× coverage per sample.

Non-duplex NanoSeq libraries

Non-duplex NanoSeq libraries were prepared using the protocol as described above, with the following modifications of 10 ng of input DNA into the library preparation. Libraries were not quantified by qPCR but instead the entire volume of library was used for a 10-cycle indexing PCR. The remainder of the protocol was carried out as described above.

Duplex sequencing bioinformatic analyses

Duplex sequencing fastqs were processed using a bespoke Nextflow47 pipeline that uses the fgbio suite (v.2.2.1) and follows the best practices for duplex sequencing processing (https://github.com/oriolpich/normal_tissues_nature_2025/tree/main/src/duplex_nf/DuplexPipe). In brief, FastqToBam was used to extract unique molecular identifiers, followed by alignment using bwa-mem (v.0.7.17)48. The reference genome used was GRCh38 (RefSeq assembly GCA_000001405.15), using a version without alternative contigs (no_alt_analysis_set), with an applied patching to allow proper mapping of the U2FA1 gene.

The bams were reformatted and template-coordinate sorted using ZipperBam and samtools (v.1.19.2) respectively. We then used GroupReadsByUmi (–edits = 1, min-map-q = 10, strategy=paired), and called consensus reads using CallDuplexConsensusReads, with minimum four consensus reads (at least two for each strand) (min-reads = 4 2 2, error_rate_pre_umi = 45, error_rate_post_umi = 40, min_input_base_quality = 20). The consensus reads were then remapped to the reference genome, and the bam containing the consensus reads aligned was filtered using FilterConsensusReads (–max-read-error-rate = 0.025 –max-base-error-rate = 0.05 –min-mean-base-quality = 50 –min-base-quality = 60 –max-no-call-fraction = 0.2 –require-single-strand-agreement true). Reads were then hard-clipped (ClipBam, –read-one-five-prime 7 –read-two-five-prime 7) and finally reads were removed if the minimum difference between the primary alignment score and the secondary alignments in forward and reverse16 was less than 50 (AS – XS ≤ 50).

Mosdepth (v.0.3.5)49 was used to identify base-pair coverage and in-target coverage. We included off-target regions (that is, regions that we did not cover originally with our panel), if the median coverage was higher than 15,000. We then called variants using VarDict50 in these regions.

Point mutations were excluded if any of the following applied: (1) the number of mismatches in reads supporting the variant was greater than 4; (2) the proportion of Ns at the mutated position (bases with no consensus) was greater than 0.05; or (3) the coverage was lower than the sample median depth minus three times the depth standard deviation. For indels, variants were removed when the proportion of Ns was greater than 0.10 or when they failed the same coverage filter.

We applied a Kolmogorov–Smirnov test to remove recurrent artefacts based on the distribution of mutant base positions within reads51 (false discovery rate (FDR) < 0.05). We also applied a whole-genome single nucleotide polymorphism (SNP) mask that comprises common SNPs, and the NOISE mask that contains sites with elevated error rates, from Abascal et al.16, removed variants with mean mismatches greater than 4, and discarded indels with MSI >5 as annotated by VarDict. To filter putative cross-sample contaminants, we computed a P value for each variant observed more than 20 times in the cohort under a global binomial model parameterized by the cohort aggregate VAF; variants with FDR < 0.05 were removed. Bona fide driver hotspots defined by Chang et al.52 supplemented with 2 driver indel hotspots (chr17_60663262_CA_C [PPM1D] and chr1_26779439_TG_T [ARID1A]), were exempt from this contamination filter.

All sites failing any of these filters, except those that only failed the Kolmogorov–Smirnov test, were aggregated into a site blacklist, which was subsequently used for analyses of positive selection.

Mutations were further classified as probably coming from blood samples using a conservative approach. For each patient, mutations are then classified as blood-like if they are found in both blood and more than 25% of non-blood samples (unless they are bona fide EGFR, PIK3CA or KRAS driver hotspots, in which case they are labelled as suspicious), or suspicious if found in more than 50% of non-blood samples but not in blood. For patients without blood samples, mutations are classified as blood-like if they occur in more than 50% of samples. Mutations deemed as non-blood-like were used for positive selection analyses.

Mutations were then annotated using Variant Ensembl Predictor (v.109)53. Variants were also annotated with AlphaMissense54, and mutations with a score higher than 0.56 were deemed as putative driver mutations, as described in the original publication.

Nanoseq samples were processed as described in their original publication (https://github.com/cancerit/NanoSeq, release 3.5.5). The normal control was either a whole-genome sequenced sample or another NanoSeq sequenced sample from the same patient.

Identification of SNPs

Homozygous SNPs were identified based on having a VAF greater than 0.95 in all samples from an individual patient. Putative heterozygous SNPs were identified as any variant present in all samples with a VAF between 0.01-0.85.

Identification of somatic copy number alterations and potential tumour contamination in individual samples

To explore the presence of somatic copy number alterations in individual samples we considered whether heterozygous SNPs deviated from the expected 0.5 or the median value across samples (thus taking into account any alignment biases).

To test for deviation, we employed a two-sided binomial test in R using the prop.test function. For each sample we calculated the proportion of SNPs that deviated from 0.5 and also the proportion of SNPs that deviated from the median proportion across samples. We observed that tumour samples exhibited clear deviation from expected values, with evidence of somatic copy number alterations associated with clonal expansions. Samples that exhibited any evidence for clonal expansion of somatic copy number alteration were removed. While these somatic copy number alterations may reflect non-malignant clonal expansions, we excluded these samples as we could not rule out potential malignant infiltration.

To further assess potential cancer cell infiltration into normal tissue we considered whether heterozygous SNPs deviated consistently between samples, potentially indicating shared clones or tumour contamination. In brief, for individuals where we sequenced a tumour sample (or a sample with a clonal expansion), we identified all heterozygous SNPs in the sample that exhibited evidence for allelic imbalance (P < 0.05, binomial test). We divided this set of SNPs into two groups, ‘high’ or ‘low’, reflecting whether the B-allele frequency was significantly greater than the median or not. Any normal sample with contamination would be expected to exhibit a significant bias whereby SNPs in the ‘high’ category should have a higher B-allele frequency than those in the ‘low’ category. We therefore performed a one-way Wilcoxon test to evaluate contamination. The following samples were removed as a result of potential contamination: CRUKP5732_2_BLOOD, R_CRUKP5732_N_AD_1, R_CRUKP5732_N_CA_1 and R_CRUKP0031_N_AD_1. Finally, we clustered samples from each individual based on the cosine similarity of their B allele frequencies. We removed any samples that clustered together with sequenced tumour samples. This led to the exclusion of R_CRUKP0031_N_LI_2, CRUKP0031_N_LI_2_1 and CRUKP0031_N_LI_2_2.

Mutational signature analysis

Hierarchical Dirichlet processes (HDP), available at https://github.com/nicolaroberts/hdp, was run to extract mutational signatures across 100 independent chains with two layers, the first one patient and the second one tissue.

For SBS extraction, we used ‘SBS1’, ‘SBS5’, ‘SBS40’, ‘SBS4’, ‘SBS92’, ‘SBS25’, ‘SBS31’, ‘SBS35’ and ‘SBS17b’ from COSMIC v.3.4 and ‘Temozolomide’ from Kucab et al.27, as priors, with the following parameters: n_posterior = 100, n_space = 2000, nburnin = 500000, ninitial_clust = 25, prior_c = 1000. For indels extraction, we used all COSMIC indels as priors and included the radiotherapy-related signature described in ref. 29, and run HDP with the following parameters: n_posterior = 200,n_space = 2000, nburnin = 50000, ninitial_clust = 30, prior_c = 20000. For double-base substition extraction, we used all COSMIC DBS signatures as priors, plus DeGasperi DBS related to cisplatin23. HDP was then run with the following parameters: n_posterior = 200, n_space = 2000, nburnin = 50000, ninitial_clust = 30, prior_c = 20000.

All of the extracted signatures were matched to COSMIC if the cosine similarity was higher than 0.9. Otherwise, we named each of the non-COSMIC signatures alphabetically (SBS-A, SBS-B, and so on).

The signature attribution to individual mutations was done as previously described13,55. In brief, given a set of exposures to mutational signatures and their mutational profile, a probability per type of mutation per patient can be generated.

Mutations per genome per cell

As discussed5,11, the mutations per genome per cell (β) can be approximated as:

$$\beta \approx 2\times \sum _j(\mathrmVAF_j)/L_\mathrmMb$$

Where j is each observed mutation, and LMb is the number of megabases sequenced with good enough coverage from our panel. The attribution of mutations per genome per cell to each signature was performed by multiplying the value to each of the signature exposures.

Relative effect of mutational processes

The contribution of certain mutational processes with respect to age was obtained by dividing their slopes. This value is scaled by 40 in smoking (considered a heavy smoker), 6 in platinum (average cycles) and 50 (average drinking years). In Fig. 4c, for treatments where we could not calculate the slope, we divided the number of treatment-related mutations and divided by the start and end date of treatment.

TCGA and other cohorts

TCGA PanCanAtlas data was obtained from ref. 36. In order to calculate the number of mutations per genome of the most common recent ancestor, we applied an approach to derive clonal and subclonal mutations56. We then selected the clonal mutations, and normalized by the length of the exome. Mutations per genome in datasets from refs. 3,4 were calculated using the same formula as above, utilizing the size of the covered region as defined in the papers.

Positive selection analyses

To evaluate positive and negative selection we implemented dNdScv35 using a bespoke reference coding sequence reflecting our sequencing panel. To mitigate against varied coverage across our panel, for each individual sample, we modified the mutation opportunity matrix L to reflect the sequence coverage. In brief, we obtained a coverage bed file for each sample, and then for each gene obtained the median coverage at each of the 192 channels, and adjusted the L matrix accordingly. Blacklisted sites were given a coverage of 0. We did the same for each tissue and for the full cohort for specific analyses.

For each sample and each gene within our panel we obtained dNdScv missense and nonsense values. To avoid depicting spurious dN/dS values resulting from genes and samples with few mutations, in Fig. 5b we additionally filtered genes in relation to the number expected missense, nonsense and silent mutations.

Specifically, in the context of missense dNdScv values, we filtered gene:sample combinations that exhibited:

<4 expected and <3 observed missense variants, or <2 total observed variants.

Likewise, in the context of nonsense dNdScv values, we filtered gene:sample combinations that exhibited:

<2 expected and observed nonsense variants, or if the combined synonymous and nonsense expected variants was <3 with <2 observed events.

Drivers were then identified using ‘qglobalpos_cv’<0.1 | ‘qsubpos_cv’<0.05 | ‘qindpos_cv’ <0.05.

Radiotherapy exposure assessment

The likelihood of a given tissue being exposed to radiotherapy was assessed by a trained radiologist using the clinical data of each patient, blinded to the radiotherapy-indel signature exposures.

Coverage and variance explained

The relationship between coverage, VAF and number of mutations was assessed through a linear mixed-effects model using the R package lme4 followed by an anova test.

Linear mixed-effects models

To evaluate the effect of both mutagenic and non-mutagenic treatment agents on different outcome variables we performed mixed-effects multivariable regression analysis. We focussed on a selection of key cancer genes (the five genes with the most excess mutations across samples), in addition to the total number of drivers, as well as mutation burden per cell. To control for the repeated samples from patients, the range of tissues and the distinct tumours which these patients harboured, we also include ‘patient’, ‘tissue’ and ‘tumour type’ as mixed effects in the model, using the lmerTest package in R.

Pack years and drink years

To estimate lifetime exposure to cigarettes we used pack years; calculated by multiplying the number of packs of cigarettes smoked per day (self-reported) by the number of years a person has smoked (self-reported). For example, smoking one pack (20 cigarettes) a day for 20 years equals 20 pack years, and smoking two packs (40 cigarettes) a day for 10 years also equals 20 pack years. We derived a similar metric to estimate lifetime alcohol consumption, ‘drink years’. We defined a baseline ‘drink year’ as drinking 14 units of alcohol a week for a year. For reference, this equates to drinking 6 pints (568 ml) of standard strength beer (5% ABV) a week for a year. Given that we did not have historical drinking information (for example, when the individual started drinking), we make the assumption that drinking is approximately consistent from the age of 18.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

RELATED ARTICLES

Most Popular

Recent Comments