Ethics
This study was carried out under TwinsUK BioBank ethics, approved by the North West–Liverpool Central Research Ethics Committee (REC reference 19/NW/0187), IRAS ID 258513 and earlier approvals granted to TwinsUK by the St Thomas’ Hospital Research Ethics Committee, later the London–Westminster Research Ethics Committee (REC reference EC04/015).
Sample collection
Semen samples were collected or obtained from archival samples with informed consent from 75 research participants in the TwinsUK cohort25. Archival whole-blood samples were also obtained from 67 of those men from the TwinsUK BioBank. A total of 104 semen samples spanned an age range of 24–75 years and included 29 men with 2 time points separated by a mean of 12.1 years (range of 12–13 years) and the remaining 46 men with a single time point. A total of 133 blood samples were collected at an age range of 22–83 years from men with 1–4 time points. The mean interval between blood time points was 8.1 years (range of 1–13 years). In the cohort, there were a total of nine monozygotic twins and three dizygotic twin pairs. Counts of samples, time points and twin pairs that were successfully sequenced and passed analysis quality control thresholds are summarized in Supplementary Table 1.
Metadata
Self-reported age, height, weight, ethnicity, twin zygosity, smoking and alcohol consumption were obtained from questionnaires provided by TwinsUK and periodically taken. All individuals who provided ethnicity information indicated white. BMI was calculated as weight/height2. A smoking pack-years was defined as 365 packs of cigarettes and a total pack-year was calculated using the highest estimate across all questionnaires from cigarettes per day or week and total years smoked. Alcohol drink-years was calculated using average weekly alcohol consumption extrapolated to the duration of adult life before sampling (age – 18).
DNA extraction
DNA was extracted from sperm samples using a Qiagen QIAamp DNA Blood Mini kit. Isolation of genomic DNA from sperm protocol 1 (QA03 Jul-10) was followed, but with the exceptions of substituting DTT in place of β-mercaptoethanol for buffer 2 and substituting buffer EB in place of buffer AE for the elution of DNA.
DNA was extracted from whole blood using a Gentra Puregene Blood kit using the protocol for 10 ml of compromised whole blood from the Gentra Puregene Handbook (v.06/2011).
Targeted gene panels
Three separate Twist Bioscience gene panels were used for targeted NanoSeq sequencing in this study: (1) a custom pilot panel of 210 genes; (2) a similar but extended custom panel of 263 genes (Supplementary Table 4); (3) and a default exome-wide panel of 18,800 genes. The two custom gene panels were highly similar, with the extended panel being almost exclusively regions added to the pilot panel. From the 84 samples that underwent targeted sequencing, 13 were sequenced using a pilot panel of 210 canonical cancer or somatic driver genes, and all 84 were sequenced using the extended panel of 263 genes. Sequencing coverage and mutations were merged from samples sequenced on both targeted panels and they were treated as ‘targeted’ samples in the Article. The custom panels were designed by gathering sets of published lists of genes implicated as drivers in cancers61,62,63,64 and somatic tissues6,65 as previously described5.
Sequencing and library preparation
Restriction-enzyme whole-genome NanoSeq libraries were prepared as previously described4 and subjected to whole-genome sequencing (WGS) at target 20–30× coverage on NovaSeq (Illumina) platforms to generate 150-bp paired-end reads with 9–10 samples per lane. Standard WGS of blood (31.7× median coverage) was used to generate matched-normal libraries for both restriction-enzyme NanoSeq blood and sperm samples.
Targeted and exome NanoSeq libraries were prepared by sonication and one to two rounds of pull down of target sequences. They were then PCR amplified and sequenced with NovaSeq (Illumina) platforms to generate 150-bp paired-end reads with 7–8 samples per lane for the targeted panel and 2 lanes per sample for the exome panel. These steps are described in detail in ‘Sonication NanoSeq, Library amplification and sequencing, and Hybridization Capture’ of supplementary note 1 of ref. 5.
Base calling and filtering
All samples were processed using a Nextflow implementation of the NanoSeq calling pipeline (https://github.com/cancerit/NanoSeq). BWA-MEM66 was used to align all sequences to the human reference genome (NCBI build37). Whole-genome NanoSeq samples were called with their matched WGS normal and default parameters except for var_b (minimum matched normal reads per strand) of 5 as needed for WGS normals, cov_Q (minimum mapQ to include a duplex read) of 15 and var_n (maximum number of mismatches) of 2.
For targeted and exome samples, we leveraged the high sequencing depth and high polyclonality to exclude variants with VAF > 10% instead of using a matched normal. Default parameters of the calling pipeline except for cov_Q of 30, var_n of 2, var_z (minimum normal coverage) of 25, var_a (minimum AS-XS) of 10, var_v (maximum normal VAF of 0.1) and indel_v (maximum normal VAF) of 0.1. After variant calling, we further filtered variants to those below 1% VAF. The few variants observed between 1% and 10% duplex VAF as variants were highly enriched for mapping artefacts, particularly for indels. No excluded variant from the additional 1% VAF threshold was found to be either a likely driver or a ClinVar pathogenic variant.
A set of common germline variants from dbsnp67 and a custom set of known artefactual call sites in NanoSeq datasets were masked for coverage and variant calls as previously described4.
Assessing DNA contamination
The single-molecule accuracy of the duplex sequencing method NanoSeq allows sequencing of polyclonal cell types such as sperm, but also renders mutation calls sensitive to nontarget cell-type contamination and to contamination of foreign DNA. Nontarget cell-type contamination was evaluated using manual cell counting of semen samples, which resulted in the exclusion of six samples with a sperm count of <1 million sperm per ml. Sperm counting methods and analysis are detailed in Supplementary Note 1.
Foreign DNA contamination in whole-genome NanoSeq samples was assessed using verifyBamID68, which checks whether reads in a BAM file match previous genotypes for a specific sample, with higher values indicating more contamination. Three blood whole-genome NanoSeq samples were excluded on the basis of a verifyBamID alpha value above the suggested4 cut-off value of 0.005. In sperm, we found that several samples had outlier mutation burdens with verifyBamID values just below the 0.005 cut-off. This result is logical, as sperm has a much lower mutation rate compared with somatic tissues, for which the recommended cut-off was designed. Consequently, sperm samples are more sensitive to low levels of contamination. To account for this, we adjusted the verifyBamID alpha threshold for sperm to a more stringent level of 0.002, which resulted in the exclusion of three samples on the basis of this criterion.
When assessing foreign DNA contamination in targeted and exome samples, we found that 9 targeted and 3 exome samples had verifyBamID values above >0.002, slight outlier mutation burdens and a high ratio of SNP masked variants to passed variants (4-fold to 16-fold more masked variants). After further investigation, we found that all samples that exceeded verifyBamID thresholds were processed in the same sequencing batch and that this contamination could be explained by inherited germline variants of other samples in that same batch. This result suggested that a small amount of cross-contamination may have occurred during sample preparation or sequencing steps. To remove contaminant germline mutations, we performed an in silico decontamination step as previously described4. This step involved calling germline variants from all targeted and exome samples using bcftools mpileup69 at sites where there were >10 reads and a mutation call with VAF > 0.3. All such sites were subsequently masked across all samples for both mutation calls and coverage, which essentially extended the default common SNP mask to also include rare inherited variants across the cohort. This approach resulted in all samples previously identified as contaminated with mutation burdens consistent with their age and all with a ratio of masked to passed variants of <0.1, and were therefore retained for analysis.
Corrected mutation burdens
Given that mutation rates are strongly influenced by trinucleotide composition, it is important to consider differences in sequence composition when comparing mutation rates in datasets that target different regions of the genome70. The coding region target panels and the restriction enzyme used in whole-genome NanoSeq for instance, each will systematically bias sequencing coverage to specific genomic regions that may not reflect the full genome. To correct for these effects, in each sperm NanoSeq dataset, we generated a corrected mutation burden relative to the full genome trinucleotide frequencies as previously described4.
Comparison of burden estimates
To compare whole-genome NanoSeq mutation burdens to mutation burdens from standard WGS, we multiplied the corrected mutation burden estimates described in the previous section by the genome size per cell type. We assumed 2,861,326,455 mappable base pairs in a haploid cell for germline datasets and the diploid equivalent for blood.
External datasets for comparison to NanoSeq results were processed to achieve comparable burden estimates. For testis WGS samples71, we implemented a previously described method4 that restricts analysis to regions with high coverage (20+ reads) that overlap with NanoSeq covered regions and corrected for differences in trinucleotide background frequencies relative to the full genome. For trio paternally phased DNMs from standard sequencing, as a callable genome size per sample following thorough filtering was available, we generated the mutation per cell estimate by multiplying the paternally phased DNM count by the ratio of the total genome size to the callable genome size of the sample as follows: DNMs paternal × (total genome/callable genome). For comparison with cell types in blood, we directly compared results to published mutation burden regressions13.
Mutation burden regressions
To investigate the relationship between age and mutation burdens, we performed linear mixed-effects regression analyses. For each tissue and mutation type for which a regression was performed, the model was constructed using the lmer function from the lme4 package72 in R. Each model included age at sampling as a fixed effect and a random slope for each individual to account for multiple time point samples, with restricted maximum likelihood (REML) set to false, specified as follows:
$$\rml\rmm\rme\rmr(\rmb\rmu\rmr\rmd\rme\rmn\sim \rma\rmg\rme+(\rma\rmg\rme-1|\rmi\rmn\rmd\rmi\rmv\rmi\rmd\rmu\rma\rml),\,\rmR\rmE\rmM\rmL=\rmF\rmA\rmL\rmS\rmE)$$
The 95% CIs for regression lines were calculated through bootstrapping by simulating prediction intervals. For each model, we generated 1,000 bootstrap samples. Predictions and their associated standard errors were calculated for a sequence of ages from 14 to 84 years. The 95% CIs were then derived by determining the range within which 95% of the bootstrap sample predictions fell.
Mutational signature analysis
We extracted DNM signatures using hierarchical dirichlet process (HDP; https://github.com/nicolaroberts/hdp), which is based on the Bayesian hierarchical dirichlet process. HDP was run with double hierarchy: (1) individual identifier (ID) and (2) tissue types (either blood or sperm), without the COSMIC reference signatures73 (v.3.3) as priors, on the mutation matrices. The number of mutations were normalized for the trinucleotide context abundance specific for each sample relative to the full genome. Both clustering hyperparameters, beta and alpha, were set to one. The Gibbs samples ran for 30,000 burn-in iterations (parameter ‘burnin’), with a spacing of 200 iterations (parameter ‘space’), from which 100 iterations were collected (parameter ‘n’). After each Gibbs iteration, three iterations of concentration parameters were conducted (parameter ‘cpiter’). Two components were extracted as DNM signatures, which were further reconstructed and decomposed into known COSMIC (v.3.3) SBS signatures using SigProfilerAssignment (https://github.com/AlexandrovLab/SigProfilerAssignment).
Quantifying selection with dN/dS
To examine genes under positive selection and to quantify global selection, we used the dNdScv algorithm28. This algorithm was extended using the base-pair-level duplex coverage, the methylation level and the pentanucleotide context to capture more complex context-dependent mutational biases and to achieve higher accuracy for our selection analysis. Detailed methods for input mutations, model selection and evaluation, site dN/dS tests, driver mutation estimation and gene set enrichment are described in Supplementary Note 3.
Gene disease and mechanism annotation
Positively selected genes were annotated with monoallelic disease consequences using the 29 February 2024 release of the DDG2P database39 and the 21 June 2024 release of the OMIM database74. OMIM annotations related to somatic disease, complex disease or tentative disease associations were excluded.
Genes were also annotated for their potential mutation mechanism observed in sperm and cancer or developmental disorders when available. In sperm, genes were labelled as LOF if they had nominal enrichment of nonsense + splice variants and/or indel variants (ptrunc_cv <0.1 | pind_cv <0.1) and 2+ LOF mutations. There were two exceptions to this, whereby genes met these thresholds but were labelled as activating owing to having a restricted repertoire of LOF mutations that are known to be oncogenic in cancers: CBL (LOFs in and downstream of the RING zinc finger domain)75 and PPM1D (LOFs in final two exons)76. All other genes had missense enrichment only and were labelled as activating. The mechanism in cancer was defined by using the ‘role in cancer’ field of the COSMIC cancer gene census (v.99)38, where ‘oncogene’ was labelled as activating and ‘tumour suppressor gene’ as LOF. Annotations of a fusion mechanism were not displayed, except for genes that had neither an oncogene nor a tumour suppressor annotation, which were labelled as ‘fusion only’. The developmental disorder mechanism was defined by using the variation consequence field of DDG2P, for which ‘restricted repertoire of mutations;activating’ was labelled as activating and ‘loss_of_function_variant’ was labelled as LOF.
Gene mutation plots
The lollipop gene mutation plots were created using a coordinate system, whereby position 1 was defined as the first coding base of the GRCh37-GencodeV18+Appris77 transcript of the gene. The data sources included protein domains from UniProt78, somatic mutations from the exome and genome-wide screens of the COSMIC (v.99)38, ClinVar (release 2024.07.01)41 pathogenic annotation, per base pair cohort-wide (targeted + exome) NanoSeq coverage, sperm mutation count (number of independent individuals with a mutation) and mutation consequence and amino acid change annotated by the dNdScv algorithm28. These data were plotted using code modified from the lolliplot function in the trackViewer R package79.
Variant annotation
Variants were annotated using Ensembl’s variant effect predictor (VEP)80 with added custom annotations of mutation context, ClinVar (release 2024.07.01)41, CADD (v.GRCh37-v1.6)42 and the average methylation level in testis. Methylation data were obtained from whole-genome shotgun bisulfite sequencing methylation data from male testis samples from a 37-year-old (ENCFF638QVP) and 54-year-old (ENCFF715DMX) from the ENCODE project71. The average methylation level was calculated by selecting CpG sites with coverage of three or more and averaging the per cent of sites methylated between the two samples.
Variants were annotated as likely monoallelic disease-causing mutations if they met at least one of two criteria: (1) reported in ClinVar as pathogenic, likely pathogenic or if they were reported as ‘conflicting classifications of pathogenicity’, where the conflict was between reports of pathogenic/likely pathogenic and ‘uncertain significance’ with no reports of benign or likely benign and not specified as a recessive condition; or (2) were a highly damaging variant in high-confidence monoallelic developmental disorder genes from DDG2P39. Genes met the following criteria: (1) allelic requirement being monoallelic_autosomal, monoallelic_X_hem, monoallelic_X_het or mitochondrial; (2) confidence in strong, definitive or moderate; and (3) a mutation consequence of ‘absent gene product’. Highly damaging was defined as being a ‘high’ impact variant in the VEP annotation (frameshift splice_acceptor, splice_donor, start_lost, stop_gained or stop_lost) or a missense variant with CADD42 score >30 (top 0.1% damaging).
Variants were defined as a likely driver if they met the ‘highly damaging’ criteria defined above in a significant germline-selection gene with LOF mutation enrichment or if they were in 1 of the 24 significant mutation hotspots. This resulted in 320 variants being labelled as likely drivers in exome samples.
Cell fraction mutation estimates
To calculate the mean count of synonymous, missense or LOF or pathogenic mutations per sperm cell, we summed the duplex VAFs of all variants in that class. For example, if an individual had 3 synonymous mutations each observed once with a duplex coverage of 100 at each of those sites, each of those variants would have a duplex VAF of 1/100 = 0.01. The sum of VAFs in this example would then be 0.03 and this would then be reported as the estimate for the mean count of synonymous variants per sperm cell for that individual. At low fractions such as 0.03, the mean count per cell is approximately equivalent to the percentage of sperm with this mutation class (3%); therefore, the driver and disease mutations are reported as percentage estimates. At higher fractions (for example, mean count > 1), the fractions are not equivalent to percentage, as many cells will have multiple variants of that class; therefore, the estimates are reported as mean count.
Expected mean counts for SNVs were generated by annotating each possible substitution at each covered site with an expected number of mutations per sample as given by expCountSNV = context_mut_rate × duplex_coverage × age_correction. The context_mut_rate was given by the 208 basePairCov + cpgMeth trinucleotide mutation model estimates for that trinucleotide + methylation mutation context (Supplementary Note 3 and Supplementary Table 9). Duplex coverage is the exact duplex coverage at that site for that sample. The age_correction parameter was given to normalize the mutation model estimates (derived from all exome samples) to the mutation rate of that sample based on age. Specifically, we fit a linear model to the mutation burden versus age of the exome samples and used this to generate a predicted mutation rate for each sample based on age. The per sample correction parameter was calculated as the age predicted mutation burden divided by the mean mutation rate of all exome samples (3.89 × 10–8). The resulting corrections spanned from 0.50 (youngest sample) to 1.42 (oldest samples). The expected indel mutation rate was calculated in the same way, except with a single mutation rate parameter (indels/bp) expCountIndel = indel_mut_rate × duplex_coverage × age_correction. The expected mean count was then calculated for each category (for example, synonymous or likely disease) by summing the expected values for each SNV and/or indel base pair matching the relevant annotation. As background for possible ClinVar pathogenic variants, we only considered indels of size 21 bp or less, the largest detected indel in the dataset. Regressions were fit with either linear models or generalized linear models (glm in R) with family = quasibinomial.
Regression analysis
We tested for associations between mutation outcome variables from sperm genome, sperm exome, sperm targeted and blood genome NanoSeq data and the phenotype predictor variables of BMI, smoking pack-years and alcohol drink-years. These tests were performed using a Gaussian family generalized linear regression in R. For each mutation outcome variable the test took the following form: glm(mutationOutcome ~ age_at_sampling + BMI + pack_years + drinkYears, family = “gaussian”).
The mutation outcome variables examined were SNV and indel burden from all four sequencing datasets, SBS1 and SBS5 count from sperm genomes, SBS1, SBS5 and SBS19 from blood genomes, and likely disease cell fraction and likely driver cell fraction from sperm targeted and sperm exomes. The significance of each predictor was assessed from the summary output coefficients of the model, and P values were adjusted for 68 total tests using the false discovery rate method.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.