Cohort
The COPL study is a prospective cohort study, with the first patients recruited in November 2020 (described in detail previously20). Women with a confirmed or suspected pregnancy loss were referred by a physician to a gynaecological department at one of three hospitals in Denmark: Copenhagen University Hospital Hvidovre, Herlev or North Zealand.
Women were included in the study if older than 18 years, if the pregnancy loss was before gestational age 22 weeks + 0 days (154 days) and if an intrauterine pregnancy had been confirmed by ultrasound, including anembryonic pregnancy as per definition. The participants provided both oral and written informed consent. Excluded from the study were pregnancies of unknown location, extrauterine pregnancies and molar pregnancies. The study was approved by the Health Research Ethics Committees for the Capital Region of Denmark (H-18024745) and followed the General Data Protection Regulation governance.
Sampling
The woman and her male partner (if possible) had venous blood drawn in 4 ml EDTA tubes that were kept at 5 °C until transport in tubes with cushioning material. All of the blood samples from the woman were drawn before initiating treatment for the pregnancy loss while fetal material was in utero, or within 24 h of passage of the fetal material. In cases of spontaneous or medically induced (mifepristone plus misoprostol) emptying of the uterus (53%), the woman was instructed (both orally and in writing) to collect all pregnancy loss tissue at home and place it in the fridge at 5 °C before bringing it to the hospital within 1 week. If she had a surgical removal (47%), COPL staff collected the material at the hospital and placed it at 5 °C. In cases of second-trimester loss, induction and delivery took place at the hospital, and clinical staff collected biopsies from the fetal foot. COPL staff packed and shipped the tissue in padded envelopes to the University of Copenhagen (maximum transportation time 2 h), where it was stored at 4 °C.
We used a previously described method to extract DNA from fetal material20. In brief, fetal tissue samples were washed several times in sterile PBS to remove maternal blood before being examined under a stereomicroscope to identify fetal tissue and/or chorionic villi. In some cases, the tissue could not be identified as either and was designated unknown. Some of these unknown samples were probably blood clots and maternal decidua. The fetal tissue or villi were dissected into pieces of 0.5 cm2 and then subjected to three consecutive PBS washes before being placed into a 2 ml cryotube (Nunc, Thermo Fisher Scientific, 10577391), submerged in liquid nitrogen for snap-freezing and then stored at −80 °C. For DNA extraction, we used the DNeasy Blood and Tissue 96 with TissueLyser LT (Qiagen, 69504) according to the instructions provided by the supplier. In brief, the samples were put on dry ice and the frozen tissue was moved to a 2 ml Safe-Lock Eppendorf tube (Eppendorf, 0030 123.344), as recommended by the supplier. A single bead (Stainless Steel Beads, Qiagen, 69989) and appropriate buffers from the DNeasy kit were added to the tube containing the frozen tissue together with appropriate buffers, which was bead milled for 1 min at 30 Hz using a 5 mm steel bead on a TissueLyser LT (Qiagen, 85600). The remaining steps were performed according to the recommendations of the supplier. Parental blood samples and extracted fetal DNA were shipped to deCODE, Iceland, on dry ice.
All information on the participants and samples was stored in a secure cloud (REDCap) in the Capital Region of Denmark. Blood tubes were labelled with individual bar codes, a pseudonymized code, and whether it was a maternal or paternal sample. Information about the last menstruation (gestational age–last menstruation; GA–LM), pregnancy loss type (missed miscarriage or spontaneous ongoing miscarriage) and gestational age at the time of loss was gathered by interview and patient records on inclusion.
WGS analysis
Sequencing libraries were prepared using the NEBNext Ultra II PCR-free kit (New England Biolabs). In brief, 0.5–1 µg of genomic DNA was fragmented in 96-well TPX-AFA plates (Covaris) to a mean target size of 450–550 bp using the Covaris LE220plus instrument. End repair and A-tailing was performed in a single step followed by ligation of unique dual-indexed sequencing adaptors (IDT for Illumina) and two rounds of SPRI-bead purification (0.6×) using the Hamilton STAR NGS liquid handler. The quality (concentration and insert size) of the sequencing libraries was determined using the LabChip GX (96-samples) instrument (Perkin Elmer). Sequencing libraries were pooled appropriately using Hamilton STARlet liquid handlers and sequenced using the Illumina’s NovaSeq 6000 instrument. We performed paired-end sequencing on the S4 flowcell (v1.0 chemistry), resulting in read lengths of 2 × 150 cycles of incorporation and imaging, in addition to 2*8 index cycles. Real-time analysis involved conversion of image data to base calling in real time. All of the steps in the workflow were monitored using an in-house laboratory information management system with barcode tracking of all of the samples/plates and reagents.
Alignment and sequence variant calling
The WGS data were aligned against the human reference genome (hg38) as previously described51 and managed with an in-house laboratory information management system. In brief, reads were aligned with BWA mem52 (v.0.7.10) and marked for duplicates with Picard tools (v.1.117). The aligned reads in BAM format for fetal and parental samples were used as input to GraphTyper53 (v.2.7.5), the average coverage of the BAM files is shown in Supplementary Fig. 2.
Paternal fraction estimation
VCF files from GraphTyper were parsed to identify genotypes where the mother was homozygous, and where the father carried an allele absent from the mother. Sites where the genotype of the parents passed the following quality thresholds were considered for downstream analysis; a genotype quality greater than 20 and depth greater than 20. Furthermore, we required the AB to be within [30%, 95%] for heterozygous calls; and within a range of [0, 5%] or [95%, 100%] for homozygous reference or alternative calls, respectively.
For sites that passed these quality criteria, we counted the number of reads in the fetus supporting the paternal allele (Y), conditioned on the genotype of the father and the DP genotype field from the fetus (D). We then regressed the number of reads supporting the paternal allele in the fetus on allele counts of the paternal allele in the father (A). For an uncontaminated euploid fetus, the slope and intercept of this regression (Y = D × A) would be expected to be 1 and 0, respectively. The slope is expected to decrease with lower paternal contribution, the same as for maternal contamination or maternal triploidies.
Coverage of the sex chromosomes (allosomes) can be informative about the maternal contamination in cases of pregnancy loss with male fetuses (Supplementary Fig. 10). Consistent with expectation, we found that fetal samples with 20% of the allosome coverage on the Y chromosome were predominantly of low contamination (90.0% paternal fraction) compared with those without Y chromosome coverage (55.9% paternal fraction; P = 2.7 × 10−61).
Estimation of genetic ancestry
Genetic ancestry of parental samples was estimated by running ADMIXTURE54 v.1.3.0 in supervised mode using 1000 Genomes55 CEU (Utah Caucasian, European), CHB (Beijing Han, East Asian), ITU (Indian Telugu, South Asian), PEL (Peruvian, Native American) and YRI (Nigerian Yoruba, African) as training populations. Markers were pruned for pairs in strong linkage disequilibrium by excluding long-range high-linkage-disequilibrium regions56 and running PLINK57 v.1.90b6.15 –indep-pairwise 200 25 0.4. ADMIXTURE was run on batches of ten test individuals at a time.
Coverage-based CNV calling
The CNV caller used for the pregnancy loss set is based on sequence coverage. Its aim is to find CNVs that are long enough for coverage to give enough evidence for calling them. To guard against too high variance in determining the copy-number state at a locus, the sequencing read coverage is binned into 1 kb regions, which is further normalized to overall genomic coverage (using the autosomes).
Sequence coverage shows dependency on G/C content and AT repeat content per region and depends on the fragment length distribution in sequence libraries. Most of the dependency can be captured by the number of G/C base pairs and the number of repeated AT motifs in each read pair fragment58. To correct for G/C content, we considered 400 bp non-overlapping windows across the genome, resulting in a table of number of reads starting at each 400 bp window, classified according to the number of G/C base pairs and the number of AT repeats. After coverage normalization, this gives a correction factor for each 400 bp window. Each 1 kb bin consists of a sequence of overlapping 400 bp fragments, each with a corresponding correction factor from the before mentioned table. The average of those correction factors is used as the correction factor for the 1 kb bin. This correction factor is applied to correct each bin, giving a G/C–AT-corrected normalized binned coverage in 1 kb bins, which are then used in the downstream CNV analysis.
The input data to the CNV caller consist of a series of G/C–AT-corrected normalized 1 kb coverage values for each sample on each chromosome. Supplementary Fig. 11 shows four such series at a locus. Each datapoint has an underlying count of sequencing reads that are assumed to be approximately Poisson distributed. Normalization and G/C–AT correction apply a scaling factor resulting in a scaled Poisson variable. The mean of the Poisson variable depends on the copy-number state of the sample in the bin, giving a mixture distribution. A few parameters specify the mean of each component in the mixture:
$$\alpha \times y \sim \mathop\sum \limits_k-0^n\pi _k\times \rmPoisson(\alpha \times \beta +\alpha \times \lambda \times \mu _k)$$
where y is the normalized coverage per 1 kb bin, μk is a multiplier per copy-number state, α is the inverse scaling factor, β is a bias factor (usually giving the mean of the copy-number 0 state since μ0 is usually 0), λ controls the distance between copy number states and πk is the prior probability of copy-number state k. For Illumina sequencing data, we use 7 states, with μk = 0.5 × k.
Each bin gets their own α, β, λ and πk values, which are set to maximum-likelihood estimates for the data for all samples within the bin. A further shift parameter is sometimes used that specifies a shifting of the data on the Poisson scale (when going to the Poisson scale from the y scale, the shift is applied). The main purpose of that parameter is to fit a higher variance in the copy-number 0 cluster than a Poisson variable with mean close to 0 can explain. Supplementary Fig. 12 shows histograms for the CNV in Supplementary Fig. 11. The copy-number clustering for copy-number states 0 and 1 (as well as 2 of course) are very clear.
The actual CNV calling is then run when the model above has been fitted. Each sample is run with a hidden Markov model (HMM) with states corresponding to the copy-number states59. Within each state, the model above is used for the emission probability given the copy-number state. The transition probabilities have a parameter specifying the probability of transitioning to a different state. The Viterbi HMM algorithm is used to find the most probable path through the HMM for each sample and, where the Viterbi path diverges from the normal copy number (2 for autosomal chromosomes), a CNV is called. The algorithm is run twice, first with a stringent transition probability (1 × 10−6) to find longer variants without splitting them into a series of shorter ones, and then with a more relaxed transition probability (0.001) to allow finding shorter variants. CNVs with similar boundaries are grouped together and the HMM is used to estimate which CNVs correspond to the same event and what is the most likely boundary for each such group. The probability model can be used to calculate the posterior likelihood for each copy-number state for each CNV, which can be used in downstream phasing and imputation.
Note that informative structural variants not captured by our copy-number analysis may have been missed because additional rearrangement analysis was not performed.
Sensitivity of the coverage-based CNV detection
To assess our sensitivity for detecting CNVs in the presence of contamination, we simulated duplications by switching between a pair of uncontaminated fetal sample with trisomy 16 and maternal sample along the chromosome (Supplementary Fig. 13). To simulate deletions, we switched between a male fetal sample and maternal sample along chromosome X. Contamination was simulated by taking the weighted average of the maternal and the fetal corrected coverage value for simulated CNV segments. We considered the following combinations of CNV sizes (5 kb, 10 kb, 20 kb, 50 kb 100 kb, 200 kb and 500 kb) and contamination values (0, 10%, 20%, 30%, 40%, 50%, 60% and 70%). We determined that we found a simulated CNV if 50% or more of the windows within the CNV were called in deletion state for deletions and duplication state for duplications in the output from the HMM program. We found 89.4% (95% CI = 79.4–95.6%) of the simulated duplications and 96.3% (95% CI = 94.5–97.6%) of the simulated deletions of size 10 kb at 30% contamination; other combinations are shown in Supplementary Fig. 13.
Combining segments and phasing
The calls from the HMM algorithm in the fetal samples were in the form of a series of small segments. Similar to the Battenberg algorithm60, we joined these disjoint segments through phase-informative sequence variants. For the joining, we assessed the allele-specific depth in the fetal sample limited to variants where parents were homozygous for different alleles.
Letting CMM be the measured maternal coverage, CM the actual maternal coverage and CCM the coverage that is due to maternal contamination, then
$$C_\rmMM=C_\rmM+C_\rmCM$$
Letting CP be the paternal coverage, then:
$$C_\rmT=C_\rmMM+C_\rmP$$
$$C_\rmT=C_\rmM+C_\rmCM+C_\rmP$$
CT represents the total coverage of the phase-informative variants (limiting to homozygous variants in either parent) in the fetal sample.
Assuming a diploid fetal sample, that is,
$$C_\rmM=C_\rmP$$
Then, we can get an estimate of the global maternal contamination coefficient CGLOB
$$C_\rmMM=C_\rmP+C_\rmCM$$
$$C_\rmGLOB=\widehatC_\rmCM=C_zrmMM-C_\rmP$$
For each segment, the maternal contribution was then corrected for CGLOB. If the maternal and paternal contributions deviated from the expected 50:50 ratio, within the predefined segments, those segments were assigned either a duplication or deletion status and assigned to the corresponding parent (phased). All calculations for the refinement of aneuploidies and CNVs were performed in R v.4.4.3, refined segments were combined into CNVs using Python v.3.9.21.
The pathogenicity of identified CNVs was estimated based on criteria from the ACMG61.
CNV breakend annotation with GraphTyper2
We discovered structural variants with Manta (v.1.6.0), merged the variant sites over all individuals with svimmer (git hash 2af9ccf1ac056b9516c25ec4a0121657a9f9ab8e), and finally genotyped them jointly across the entire set with GraphTyper2 (v.2.7.7). We considered all types of structural variants within 5 kb of CNV breakends. We searched for structural variant DNM candidates using the same criteria as described in the ‘DNM detection’ section with two exemptions: a more permissive allelic balance cut-off (10%) and a lower minimum number of reads supporting the DNM allele (6). If we found a DNM candidate among the structural variants of the CNV carrier within this region, then we set that as the position of the CNV breakpoint Supplementary Table 2. To search for balanced translocations present in the parents, we manually curated the translocation calls by GraphTyper with an allele count less than 10. We found exact breakpoints of the two balanced translocations that are present in parents (Supplementary Figs. 6 and 7).
Kinship and oversharing
We estimated kinship between fetal samples and parental samples using KING62 (v.2.1.5) on SNVs in subsampled 50 kb regions of the genome (1%). We considered only biallelic SNVs, with allele counts greater than four and an AAScore from GraphTyper of greater than 0.5.
We searched for indication of triploidy based on oversharing between the fetal sample and either of the parental samples, that is, a higher kinship coefficient than expected for parent–offspring relationship (k = 0.25). If kmat represents the kinship coefficient between the fetal sample and the maternal sample and kpat represents the kinship coefficient between the fetal sample and the paternal sample, we considered the following scenarios:
$$k_\rmpat > 0.3\,\rmand\,k_\rmmat > \,0.2$$
$$k_\rmpat > 0.2\,\rmand\,k_\rmmat > \,0.3$$
as potential triploidies. We then used the change in variance of allelic balance, as described in the next section, across the genome to further refine these potential triploidies.
Detecting crossovers between chromosomal states
We defined phase-informative variants as variants present in heterozygous state in one parent and homozygous in the other. In the following method, we restricted to phase-informative heterozygous variants in the parent transmitting the extra chromosome, where the other parent had a homozygous genotype.
If two sister chromatids from the transmitting parent were present at a given locus, we expected the AB of HS alleles at that locus to be either 0 or 2/3, depending on whether or not the sisters carried the HS allele (Fig. 2b). If both homologous chromosomes were present at a locus, we expected the AB to always be 1/3, considering the presence of an HS-allele in the transmitting parent. The state of homologous chromosomes has much lower variance than the state of sister chromatids, as there is no variability in sampling the homologous chromosomes. The sampling variance of the homologous chromosome state should be due to only sequencing variability, that is, reads supporting the alternative allele in the fetal sample.
We let Y be the observed coverage of the phase-informative variant in the fetal sample. T is a Bernoulli variable denoting the transmission of the phase-informative variant from the parent to the fetus. Then, by the law of total variance, we can partition the variance based on transmission T:
$$V[Y]=E[V[Y| T]]+V[E[Y| T]]$$
If we assume the read coverage in the fetus given transmission is determined with a Poisson variable:
$$Y| T=1 \sim \rmPoisson(\mu )$$
Then:
$$\beginarraycE[V[Y| T]]=P_T\times V[Y| T=1]+(1-P_T)\times V[Y| T=0]\\ \,\,\,\,\,\,=\,P_T\times V[Y| T=1]+(1-P_T)\times 0\\ \,\,\,\,\,\,=\,P_T\times \mu \endarray$$
The other part of the sum will be:
$$\beginarraycV[E[Y| T]]=P_T\times (\mu -P_T\cdot \mu )^2+(1-P_T)\times (0-P_T\cdot \mu )^2\\ \,\,\,\,\,=\,P_T\mu ^2(1-P_T)^2+P_T^2\mu ^2(1-P_T)\\ \,\,\,\,\,=\,P_T\mu ^2-2\times P_T^2\mu ^2+P_T^3\mu ^2+P_T^2\mu ^2(1-P_T)\\ \,\,\,\,\,=\,P_T\mu ^2-P_T^2\mu ^2\\ \,\,\,\,\,=\,\mu ^2P_T(1-P_T)\endarray$$
Taken together:
$$V[Y]=P_T\cdot \mu +\mu ^2P_T(1-P_T)$$
If both homologous chromosomes are present (Pi = 1) at 30× coverage in the fetus (μ = 10) then we expect:
$$V[Y]=1\times 10+10^2\times 0=10$$
By contrast, in the sister state (Pr = 0.5) the variance is much higher:
$$V[Y]=0.5\times 20+20^2\times 0.25=110$$
We used this change in variance of the allelic balance along the genome to find crossovers between the homologous chromosome states and sister chromatid states. We binned the number of phase-informative variants in 1 kb windows, and calculated the average allelic balance. We excluded 1 kb windows that overlapped larger 100 kb windows with sequence variants with low average quality metrics (AAScore < 0.8 or Carrier_regression_beta < 0.7; see the ‘DNM detection’ section for details of AAScore and Carrier_regression_beta). We then used a changepoint technique on the 1 kb windows that passed the exclusion criteria to detect changes in variance in the 1 kb windows (cpt.var from the changepoint package in R). We used the following criteria:
-
A minimum segment length of 1000 1 kb windows (1 Mb)
-
A maximum of eight changepoints per chromosome
-
The PELT method for detecting multiple changepoints
-
Our own penalty function (penalty = “manual”) with 10 × log(n) (pen.value = “10⋅log(n)”) where n is the number of windows per chromosome.
DNM detection
We searched sequence variants and extracted DNM candidates in a similar manner as before6,8 for 9,651 Icelandic trios by comparing the genotypes of the parents and offspring. The DNM candidates included both SNVs and indels (Supplementary Fig. 14). In brief, we defined a DNM candidate with permissive cut-offs for the genotype of the proband requiring an allele balance greater than 0.25 and depth of 12 reads at the position (supporting either the reference or alternative allele). For the genotypes of the parents, we required at least 12 reads, a maximum of one read supporting the alternative allele and the allelic balance to be less than 5%. Likely (NLIK) and possible (NPOSS) carriers of the DNM allele outside the descendants of the parent pair were defined as previously6. We restricted to DNM candidates with less than 50 likely carriers and either less than 10 possible carriers or a NLIK/NPOSS ratio greater than 80%. We tuned the DNM candidate filtering by using segregation of DNM candidates in three generation families (2,042 probands) and the following quality covariates in a generalized additive model with a logistic link:
-
AAScore: prediction probability from GraphTyper that the variant is a true positive.
-
Carrier_regression_beta: slope from the alternative allele depth regression for the sequence variant. To estimate the quality of the sequence variants, we regressed the alternative allele counts (AD) on the depth (DP) conditioned on the genotypes (GT). For a well-behaving sequence variant, the mean alternative allele count for a homozygous reference genotype should be 0; for a heterozygous genotype, it should be DP/2; and, for a homozygous alternative genotype, it should be DP. Under the assumption of no sequencing or genotyping error, the expected value of AD should be DP conditioned on the genotype; in other words, an identity line (slope 1 and intercept 0). Deviations from the identity line indicate that the sequence variant is spurious or somatic.
-
Carrier_regression_alpha: intercept from the alternative allele depth regression for the sequence variant.
-
Proband_het_AB: the allelic balance of the proband.
-
MaxAAS: the maximum read support for the sequence variant across all individuals.
-
Alignment_Alt_Reads: the number of reads supporting the alternative allele. This covariate and the following covariates were derived by identifying the reads in the BAM files supporting DNM allele.
-
Alignment_Alt_Unique_Positions: the unique number of starting positions for the reads supporting the alternative allele.
-
Alignment_Alt_Soft_clipped: the number of soft clipped bases (S in CIGAR string).
-
Alignment_Alt_Matched_bases: the number of matched bases (M in CIGAR string).
-
Alignment_Alt_Score_diff: the difference of the alignment score of the best and the second best hit as reported by BWA mem.
-
Alignment_Alt_Pair_sw_nm: the pairwise mismatches between reads supporting the alternative allele using Smith Waterman implementation in SeqAn63.
-
Alignment_Alt_Pair_align: the number of bases in the pairwise alignments.
With the following formula for the gam function from the mgcv R package64:
threegen_Consistent_hs~
I(cut(alignment_Alt_Unique_Positions,c(-1,2,4,8,10,Inf)))+
s(I(AAScore))+
s(Carrier_regression_beta)+
s(Carrier_regression_alpha)+
I(ifelse(alignment_Alt_Reads>0,
(alignment_Alt_Score_diff/alignment_Alt_Reads)>10,
FALSE))+
I(ifelse(alignment_Alt_Pair_align>0,
(alignment_Alt_Pair_sw_nm/alignment_Alt_Pair_align)>0.05,
TRUE))+
I(ifelse(alignment_Alt_Matched_bases>0,
alignment_Alt_Soft_clipped/alignment_Alt_Matched_bases>0.5,
TRUE))+
s(Proband_het_AB)+
s(ifelse(MaxAAS>15, 16, MaxAAS))+
I(NPOSS == 0)
We restricted to instances of the DNM candidates in cases in which we saw both of the proband’s haplotypes at a locus transmitted to the offspring of the proband (see figure 1c in ref. 6). In brief, if the DNM is a true germline variant, then the allele of the DNM candidate should be present in one of the proband’s offspring. On the other hand, if it is absent from the children, then this assay suggests that the DNM candidate is a false-positive DNM call (a more detailed description was reported previously6). Like before, we fitted the generalized additive model using the mgcv R package64.
We used the cut-off of 0.5 to call high-quality DNM candidates for the predicted probability of correct segregation in three generation families. To validate the false-positive detection rate of the DNMs, we also used the genotype consistency between pairs of monozygotic twins. We found that 3.8% of DNMs were unobserved in the monozygotic twin of the proband. Note that this approach overestimates the false-positive rate as there are instances of high frequency post-zygotic mutations that differ between pairs of monozygotic twins65.
Identifying DNMs in COPL trios
The AB of DNM candidates is a crucial quality-control metric for discerning between false positive DNMs (that is, somatic and genotyping artifacts) and DNMs that transmit to the next generation6. However, in the case of the triploidies, trisomies and the maternally contaminated samples, the AB is expected to be lower than 50% for genuine DNMs. This is not compatible with the generalized additive model applied on the deCODE trios that uses the AB; we therefore decided on a simpler threshold strategy to define high-quality DNMs in the COPL data. We defined a high-quality DNM in the COPL set as one that satisfies AB of proband > 0.25; NPOSS < 10 and AAScore > 0.5. To compare the subsets in COPL, we calculated the detection power to detect DNMs per fetal sample using the ratio of the phase-informative markers and a binomial distribution (Supplementary Fig. 9). We considered only samples with paternal fraction > 25% and excluded a single sample with more than 800 DNMs after quality filtering.
Phasing of mutations in Icelandic trios
Like before, we used two complementary approaches to phase DNMs6,8: one by tracking of the transmission of the DNMs to the offspring of the proband (142,069 DNMs; three-generation phasing) and another one using read tracing of the DNM allele to phase-informative alleles (246,877 DNMs; read phasing). We combined the phase of the DNMs from the different approaches, resulting in 333,365 phased DNMs; if the phase of DNMs differed between the approaches, they were set to unphased DNMs in the downstream analysis.
Phasing of mutations in COPL trios
We read-traced the DNM allele as with deCODE data, except for one crucial difference—we considered only reads linking the DNM allele to nearby phase-informative sequence variants, if the reads were linked to an allele private to one of the parents. This was done to avoid phase ambiguity due maternal contamination or extra chromosomes.
Annotation of pathogenic variants
Sequence variants were annotated on the basis of release 100 of the Variant Effect Predictor (VEP)66 using RefSeq gene annotations67. We included SNVs and small indels (<50 bp) at coding and splicing regions in the autosome. We required that the variants were supported by at least six sequence reads, with at least three reads supporting the alternative allele. For indels, we required an AB of at least 0.4. For heterozygous variants, we considered pLoF (stop-gained, frameshift indels, splice essential variants) and missense DNMs in known autosomal-dominant disease genes (human phenotype ontology code HP:0000006 for autosomal-dominant disease genes). For missense DNMs, we also required either that the gene was highly constrained for missense variants (95th percentile over all genes, corresponding to a gnomAD68 v.4.1.0 missense z score of over 3) or that the variant was nearby (within <10 bp) a previously reported pathogenic/likely pathogenic missense variant on ClinVar (with at least one ClinVar submission supporting the variant as pathogenic/likely pathogenic and no benign/likely benign submissions).
We also included homozygous and compound heterozygous pLoF variants in known autosomal recessive disease genes (based on the code HP:0000007). For homozygous variants, we required the position to be covered by at least eight sequence reads, and for the alternative allele to be supported by at least eight reads. For compound heterozygous variants, we required the positions to be covered by at least four reads, and each variant to be supported by at least two sequence reads. To detect compound heterozygous genotypes, we performed phasing of the variants to paternal and maternal alleles. We also included genotypes where we could not phase both variants, allowing for the possibility of a DNM on one allele. We used minor allele frequencies from 56,969 Icelandic genomes69 and 730,947 exomes and 76,215 genomes from gnomAD11 to restrict to variants with frequencies <2%, and considered only genotypes present in fewer than three genomes. All pathogenic genotypes detected in the trios affected by pregnancy loss were confirmed by manual inspection of BAM files.
The same approach was used for defining pathogenic genotypes in a set of adult control trios. For the purpose of pathogenic SSV analysis, individuals with neurodevelopmental disorders, including epilepsy, autism and intellectual disability, were removed from the set of 9,651 Icelandic trios. Those patients, who were whole-genome sequenced as part of ongoing projects at deCODE genetics and have been described previously70, were ascertained through the State Diagnostic Counselling Center and the Department of Child and Adolescent Psychiatry and diagnosed based on ICD-10 criteria. Individuals with a clinical diagnosis of a rare genetic disease, as described previously69, were also removed from downstream analyses. Removal of these patients from the set of Icelandic trios resulted in a set of 7,760 adult control trios.
Constrained genes
To investigate DNMs in genes that have not been linked to any postnatal phenotypes but that might be essential to early developmental processes, we looked at genes that are highly constrained for pLoF variants. We used constraint data from gnomAD68 v.4.1.0, ranking genes (n = 17,481 protein-coding genes) based on their pLI scores. We defined highly constrained genes as the genes in the 90th percentile of all ranked genes.
Age regression
To compare the accumulation of mutations with advancing parental age at conception between pregnancy loss and the Icelandic trios, we regressed the number of DNMs against the parental ages at conception with the following formula:
Nr_of_DNMs~
fathers_age+
mothers_age+
tissue_villi
Note that, we are using a shared intercept model. This is due the fact DNMs detected in the fetal samples are only partially phased through the read-tracing, which results in unidentifiable sex-specific intercepts. We restricted to the subset of the pregnancy loss cohort that is more comparable to the Icelandic set, in terms of quality, that is, we restricted to non-triploid samples with 95% power of detecting DNMs.
We ran separate regressions for the different mutational classes (C>A, C>G, C>T, CpG>TpG, Indel, T>A, T>C and T>G) along with the total number of DNMs per fetal sample/proband. The estimates of the age effects, villus contribution and intercepts are provided in Extended Data Fig. 7 and Supplementary Table 3. The CIs and P values are derived from a block jackknife procedure.
Mutational signatures
To estimate the contribution of COSMIC mutational signatures to the 96-class mutation spectra of the pregnancy loss samples, we used the Fit() function provided by the signature.tools.lib (v.2.4.4) package71 in R (https://github.com/Nik-Zainal-Group/signature.tools.lib). Note that the mutation burden of each pregnancy loss sample is modest, which limits power for signature detection. The purpose of this exercise was not to discover mutational signatures de novo but, rather, to specifically test the hypothesis that SBS18 was elevated in some villus samples as reported previously34. The mutation spectrum of spermatogonia is dominated by SBS1 and SBS5/40 (ref. 72) and this is mirrored in the mutational spectrum of germline variants (see figure 5 of ref. 73). With this in mind, we fitted only SBS1, SBS5 and SBS18 onto the mutational catalogues. Signatures were fitted using 100 bootstraps.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.