Ethics declarations
Human participants
Informed consent was obtained from the CEPH/Utah individuals, and the University of Utah Institutional Review Board approved the study (University of Utah IRB reference IRB_00065564). This includes informed consent for publication of research data for 23 family members; the remaining 5 provided informed consent for biobanking with controlled access (Data availability).
Cell lines
Cell lines for 14 members of the CEPH 1463 family (G1-GM12889, G1-GM12890, G1-GM12891, G1-GM12892, G2-GM12877, G2-GM12878, G3-GM12879, G3-GM12881, G3-GM12882, G3-GM12883, G3-GM12884, G3-GM12885, G3-GM12886 and G3-GM12887) were obtained from Coriell Institute for Medical Research (CEPH collection). Cell lines for G3 spouses and G4 family members (n = 13) were generated in-house as EBV transformed lymphoblastoid cell lines and include: G3-200080-spouse, G4-200081, G4-200082, G4-200084, G4-200085, G4-200086, G4-200087, G3-200100-spouse, G4-200101, G4-200102, G4-200103, G4-200104 and G4-200106.
All cell lines were authenticated by WGS of the DNA and subsequent variant calling to match the expected sex of the individual and sequencing results from blood-derived DNA from the same individual. Furthermore, we explored whether the obtained sequencing data match the expected inheritance patterns of parents and offspring. To our knowledge, none of the cell lines mentioned above were tested for mycoplasma contamination.
Sample and DNA preparation
Family members from G2 and G3 were re-engaged for the purpose of updating informed consent and health history, and for enrolling their children (G4) and the marry-in parent (G3). Archived DNA from G2 and G3 was extracted from whole blood. Newly enrolled family members underwent informed consent, and blood was obtained for DNA and cell lines. DNA was extracted from whole blood using the Flexigene system (Qiagen 51206). All samples are broadly consented for scientific purposes, which makes this dataset ideal for future tool development and benchmarking studies.
Sequence data generation
Sequencing data from orthogonal short- and long-read platforms were generated as follows:
Illumina data generation
Illumina WGS data for G1–G3 were generated as previously described14. Illumina WGS data for G4 and marry-in spouses for G3 were generated by the Northwest Genomics Center using the PCR-free TruSeq library prep kit and sequenced to approximately 30× on the NovaSeq 6000 with paired-end 150 bp reads.
PacBio HiFi sequencing
PacBio HiFi data were generated according to the manufacturer’s recommendations. In brief, DNA was extracted from blood samples as described or cultured lymphoblasts using the Monarch HMW DNA Extraction Kit for Cells & Blood (New England Biolabs, T3050L). At all steps, quantification was performed with Qubit dsDNA HS (Thermo Fisher Scientific, Q32854) measured on DS-11 FX (Denovix) and the size distribution checked using FEMTO Pulse (Agilent, M5330AA and FP-1002-0275.) HMW DNA was sheared with the Megaruptor 3 (Diagenode, B06010003 & E07010003) system using the settings 28/30, 28/31 or 27/29 based on the initial quality check to target a peak size of ~22 kb. After shearing, the DNA was used to generate PacBio HiFi libraries using the SMRTbell prep kit 3.0 (PacBio, 102-182-700). Size selection was performed either with diluted AMPure PB beads according to the protocol, or with Pippin HT using a high-pass cut-off between 10–17 kb based on shear size (Sage Science, HTP0001 and HPE7510). Libraries were sequenced either on the Sequel II platform on SMRT Cells 8M (PacBio, 101-389-001) using Sequel II sequencing chemistry 3.2 (PacBio,102-333-300) with 2 h pre-extension and 30 h movies on SMRT Link v.11.0 or 11.1, or on the Revio platform on Revio SMRT Cells (PacBio, 102-202-200) and Revio polymerase kit v1 (PacBio, 102-817-600) with 2 h pre-extension and 24 h movies on SMRT Link v.12.0.
ONT sequencing
To generate UL sequencing reads >100 kb, we used ONT sequencing. Ultra-high molecular mass gDNA was extracted from the lymphoblastoid cell lines according to a previously published protocol64. In brief, 3–5 × 107 cells were lysed in a buffer containing 10 mM Tris-Cl (pH 8.0), 0.1 M EDTA (pH 8.0), 0.5% (w/v) SDS, and 20 mg ml−1 RNase A for 1 h at 37 °C. Then, 200 μg ml−1 proteinase K was added, and the solution was incubated at 50 °C for 2 h. DNA was purified through two rounds of 25:24:1 phenol–chloroform–isoamyl alcohol extraction followed by ethanol precipitation. Precipitated DNA was solubilized in 10 mM Tris (pH 8.0) containing 0.02% Triton X-100 at 4 °C for 2 days.
Libraries were constructed using the Ultra-Long DNA Sequencing Kit (ONT, SQK-ULK001) with modifications to the manufacturer’s protocol: ~40 μg of DNA was mixed with FRA enzyme and FDB buffer as described in the protocol and incubated for 5 min at room temperature, followed by heat inactivation for 5 min at 75 °C. RAP enzyme was mixed with the DNA solution and incubated at room temperature for 1 h before the clean-up step. Clean-up was performed using the Nanobind UL Library Prep Kit (Circulomics, NB-900-601-01) and eluted in 450 μl EB. Then, 75 μl of library was loaded onto a primed FLO-PRO002 R9.4.1 flow cell for sequencing on the PromethION (using MinKNOW software v.21.02.17–23.04.5), with two nuclease washes and reloads after 24 and 48 h of sequencing. All G1–G3 ONT base calling was done with Guppy (v.6.3.7).
Element (AVITI) sequencing
Element WGS data were generated according to the manufacturer’s recommendations. In brief, DNA was extracted from whole blood as described above. PCR-free libraries were prepared using mechanical shearing, yielding ~350 bp fragments, and the Element Elevate library preparation kit (Element Biosciences, 830-00008). Linear libraries were quantified by quantitative PCR and sequenced on AVITI 2 × 150 bp flow cells (Element Biosciences, not yet commercially available). Bases2Fastq Software (Element Biosciences) was used to generate demultiplexed FASTQ files.
Strand-seq library preparation
Single-cell Strand-seq libraries were prepared using a streamlined version of the established OP-Strand-seq protocol65 with the following modifications. In brief, EBV cells from G1–3 were cultured for 24 h in the presence of BrdU and nuclei with BrdU in the G1 phase of the cell cycle were sorted using fluorescence-activated cell sorting as described previously65. Next, single nuclei were dispensed into individual wells of an open 72 × 72 well nanowell array and treated with heat-labile protease, followed by digestion of DNA with the restriction enzymes AluI and HpyCH4V (NEB) instead of micrococcal nuclease (MNase). Next, fragments were A-tailed, ligated to forked adapters, UV-treated and PCR-amplified with index primers. The use of restriction enzymes results in short, reproducible, blunt-ended DNA fragments (>90% smaller than 1 kb) that do not require end-repair before adapter ligation, in contrast to the ends of DNA generated by MNase. Omitting end-repair enzymes allows dispensing of index primers in advance of dispensing individual nuclei. The pre-spotted, dried primers survive and do not interfere with the library preparation steps before PCR amplification. Pre-spotting of index primers is more reliable than the transfer of index primers between arrays during library preparation as described previously65. Strand-seq libraries were pooled and cleaned with AMPure XP beads, and library fragments between 300 and 700 bp were gel purified before PE75 sequencing on either the NextSeq 550 or the AVITI (Element Biosciences) system. Supplementary Fig. 40 shows examples of Strand-seq libraries made with restriction enzymes.
Strand-seq data post-processing
The demultiplexed FASTQ files were aligned to both GRCh38 and T2T-CHM13 reference assemblies (Supplementary Table 14) using BWA66 (v.0.7.17-r1188) for standard library selection. Aligned reads were sorted by genomic position using SAMtools67 (v.1.10) and duplicate reads were marked using sambamba68 (v.1.0). Libraries passing quality filters were pre-selected using ASHLEYS69 (v.0.2.0). We also evaluated such selected Strand-seq libraries manually and further excluded libraries with an uneven coverage, or an excess of ‘background reads’ (reads mapped in opposing orientation for chromosomes expected to inherit only Crick or Watson strands) as previously described70. This is done to ensure accurate inversion detection and phasing.
Strand-seq inversion detection
Polymorphic inversions for G1–G3 were detected by mapping Strand-seq read orientation with respect to the reference genome as previously described71,72. For each sample, we selected 60+ Strand-seq libraries (range, 62–90) with a median of around 274,000 reads with mapping quality ≥10 per library, translating to about 0.67% genome (T2T-CHM13) being covered per library (Supplementary Fig. 41). Then we ran breakpointR73 (v.1.15.1) across selected Strand-seq libraries to detect points of strand-state changes73. We used these results to generate sample-specific composite files using breakpointR function ‘synchronizeReadDir’ as described previously71. Again, we ran breakpointR on such composite files to detect regions where Strand-seq reads map in reverse orientation and are indicative of an inversion. Lastly, we manually evaluated each reported inverted region by inspection of Strand-seq read mapping in UCSC Genome Browser74 and removed any low-confidence calls. We phased all inversions using Strand-seq data as well and then synchronized the phase with phased genome assemblies based on haplotype concordance. Lastly, we evaluated the Mendelian concordance of detected and fully phased inversions. We mark sites where at least half of the G3 samples were fully phased by Strand-seq and concordant with possible inherited G2 parental alleles as being Mendelian concordant (Supplementary Table 7).
Generation of phased genome assemblies
Phased genome assemblies were generated using two different algorithms, namely Verkko (v.1.3.1 and v.1.4.1)16 and hifiasm (UL) with ONT support (v.0.19.5)17. Owing to active development of the Verkko and hifiasm algorithms, assemblies were generated with two different versions. Phased assemblies for G2–G3 were generated using a combination of HiFi and ONT reads using parental Illumina k-mers for phasing. To generate phased genome assemblies of G1, we still used a combination of HiFi and ONT reads with the Verkko pipeline and used Strand-seq to phase assembly graphs75. Lastly, G4 samples were assembled using HiFi reads only with hifiasm (v.0.19.5).
Note that trio-based phasing with Verkko assigns maternal to haplotype 1 and paternal to haplotype 2. By contrast, for hifiasm assemblies, we report switched haplotype labelling such that haplotype 1 is paternal and haplotype 2 is maternal to match HPRC standard for hifiasm assemblies.
Evaluation of phased genome assemblies
To evaluate the base pair and structural accuracy of each phased assembly, we used a multitude of assembly evaluation tools as well as orthogonal datasets such as PacBio HiFi, ONT, Strand-seq, Illumina and Element data. Known assembly issues are listed in Supplementary Table 4. We note that we fixed four haplotype switch errors in our assembly-based variant callsets to avoid biases in subsequent analysis. The assembly-quality terminology used in this Article is described in Supplementary Note 8.
Strand-seq validation
We used Strand-seq data to evaluate directional and structural accuracy of each phased assembly. First, we aligned selected Strand-seq libraries for each sample to the phased de novo assembly using BWA66 (v.0.7.17-r1188). We next ran breakpointR73 (v.1.15.1) using aligned BAM files as the input. We then created directional composite files using the breakpointR function createCompositeFiles followed by running breakpointR on such composite files using the runBreakpointR function. This provided us, for any given sample, with regions where strand-state changes across all single-cell Strand-seq libraries. Many such regions point to real heterozygous inversions. However, regions where Strand-seq reads mapped in opposite orientation with respect to surrounding regions are probably caused by misorientation. Moreover, positions where the strand state of Strand-seq reads changes repeatedly in multiple libraries might be a sign of an assembly misjoin and such regions were investigated more closely to rule out any such large structural assembly inconsistencies.
Read to assembly alignment
To evaluate de novo assembly accuracy, we aligned sample-specific PacBio HiFi reads to their corresponding phased genome assemblies using Winnowmap76 (v.2.03) with the following parameters:
-I 10G -Y -ax map-pb –MD –cs -L –eqx
Flagger validation
Flagger9 was used to detect misassemblies using HiFi read alignments to the assemblies and the assemblies aligned to the reference genome. Regions were flagged on the basis of read alignment divergence and specific reference-biased regions. A reference-specific BED file (chm13v2.0.sd.bed) was used, setting a maximum read divergence of 2% and specifying reference-biased blocks. These flagged regions were analysed to identify collapses, false duplications, erroneous regions and correctly assembled haploid blocks with the expected read coverage.
We used Flagger v.0.3.3 (https://github.com/mobinasri/flagger) to run the flagger_end_to_end WDL.
Required inputs include following:
-
(1)
Read-to-contig alignments—Winnowmap alignments of all HiFi reads to the assembly (hap1, hap2 and unassigned.fasta)
-
(2)
A combined assembly fasta file with hap1, hap2 and unassigned contigs
-
(3)
BAM alignments of assembly to the CHM13v2.0 reference
hap1, hap2 and unassigned fasta files of the assembly were aligned to CHM13v2.0 using a pipeline available at GitHub (https://github.com/mrvollger/asm-to-reference-alignment).
NucFreq validation
NucFreq77 (v.0.1) was used to calculate nucleotide frequencies for HiFi reads aligned using Winnowmap76 (v.2.03). This was used to identify regions of collapses, where the second-highest nucleotide count exceeded 5; and misassembly, where all nucleotide counts were zero.
The NucFreq analysis pipeline is available at GitHub (https://github.com/mrvollger/NucFreq).
Assembly base-pair quality
To evaluate the accuracy of the genome assembly, we used a pipeline that uses Meryl78 (v.1.0) to count the k-mers of length 21 from Illumina reads using the following command:
meryl k=21 count input.fastq output output.meryl
We then used Merqury78 (v.1.1), which compares the k-mers from the sequencing reads against those in the assembled genome and flags discrepancies where k-mers are uniquely found only in the assembly. These unique k-mers indicate potential base-pair errors. Merqury then calculates the quality value based on the k-mer survival rate, estimated from Meryl’s k-mer counts, providing a quantitative measure to assess the completeness and correctness of the genome assembly.
Gene completeness validation
To evaluate the completeness of single-copy genes in our assemblies, we used compleasm79 (v.0.2.4). Further details are available at GitHub (https://github.com/huangnengCSU/compleasm).
We ran compleasm with the following parameters:
compleasm.py run -a assembly.fasta -o results/sample.id -t threads -l params.lineage -m params.mode -L params.mb_downloads
-l primates
-m busco
-L params.mb_downloads
and downloaded using: compleasm_kit/compleasm.py download primates.
Assembly to reference alignment
All de novo assemblies were aligned to both GRCh38 as well as to the complete version of the human reference genome T2T-CHM13 (v2) using minimap2 (ref. 80) (v.2.24) with the following command:
minimap2 -K 8G -t threads -ax asm20 \
–secondary=no —eqx -s 25000 \
input.ref input.query \
| samtools view -F 4 -b – > output.bam
A complete pipeline for this reference alignment is available at GitHub: (https://github.com/mrvollger/asm-to-reference-alignment).
We also generated a trimmed version of these alignments using the rustybam (v.0.1.33) (https://github.com/mrvollger/rustybam) function trim-paf to trim redundant alignments that mostly appear at highly identical SDs. With this, we aim to reduce the effect of multiple alignments of a single contig over these duplicated regions.
Definition of stable diploid regions
For this analysis, we use assembly to reference alignments (see the ‘Assembly to reference alignment’ section), reported as PAF files. We used trimmed PAF files reported by the rustybam trim-paf function. Stable diploid regions were defined as regions where phased genome assemblies report exactly one contig alignment for haplotype 1 as well as haplotype 2 and are assigned as ‘2n’ regions. Any region with two or more alignments per haplotype is assigned as ‘multi’ alignment. Lastly, regions with only single-contig alignment in a single haplotype are assigned as ‘1n’ regions. These reports were generated using the ‘getPloidy’ R function (Code availability).
Detection and analysis of meiotic recombination breakpoints
We constructed a high-resolution recombination map of G2 and G3 individuals using three orthogonal approaches that differ either on the basis of the underlying sequencing technology or on the detection algorithm applied to the data. The first approach is based on chromosome-length haplotypes extracted from Strand-seq data using R package StrandPhaseR81 (v.0.99). The second approach uses inheritance vectors derived from Mendelian consistency of small variants across the family pedigree13. Our final approach uses trio-based phased genome assemblies followed by small variant calling using PAV and Dipcall to more precisely define the meiotic breakpoints.
By contrast, a recombination map of G4 individuals was constructed using a combination of Strand-seq data for G3 spouses and an assembly-based variant callset (Dipcall) of G4 samples. Owing to the use of a low-variant density of Strand-seq-based haplotypes of G4 spouses, reported recombination breakpoints are of lower resolution in comparison to G2 and G3 samples (Supplementary Table 8).
Detection of recombination breakpoints using circular binary segmentation
To map meiotic recombination breakpoints using circular binary segmentation, we used two different datasets. The first dataset represents phased small variants (SNVs and indels) as reported by Strand-seq-based (SSQ) phasing22,81. The other is based on small variants reported in trio-based phased assemblies either by PAV8 (v.2.3.4) or Dipcall82 (v.0.3). With this approach, we set to detect recombination breakpoints as positions where a child’s haplotype switches from matching H1 to H2 of a given parent or vice versa. To detect these positions, we first established which homologue in a child was inherited from either parent by calculating the level of agreement between child’s alleles and homozygous variants in each parent. Next, we compared each child’s homologue to both homologues of the corresponding parent and encoded them as 0 or 1 if they match H1 or H2, respectively. We applied a circular binary segmentation algorithm on such binary vectors by using the R function fastseg implemented in the R package fastseg83 (v.1.46.0) with the following parameters: fastseg (binary.vector, minSeg=, segMedianT=c (0.8, 0.2)). In the case of sparse Strand-seq haplotypes, we set the fastseg parameter minSeg to 20 and, in the case of dense assembly-based haplotypes, we used a larger window of 400 and 500 for Dipcall- and PAV-based variant calls to achieve comparable sensitivity in detecting recombination breakpoints. The regions with a segmentation mean of ≤0.25 are then marked as H1 while regions with a segmentation mean of ≥0.75 are assigned as H2. Regions with a segmentation mean in between these values were deemed to be ambiguous and were excluded. Moreover, we filtered out regions shorter than 500 kb and merged consecutive regions assigned the same haplotype (Code availability).
Detection of meiotic recombination breakpoints using inheritance vectors
DeepVariant calls (see the ‘Read-based variant calling’ section) from HiFi sequencing data from G1, G2 and G3 pedigree members allow us to identify the haplotype of origin for heterozygous loci in G3 and infer the occurrence of a recombination along the chromosome when the haplotype of origin changes between loci. An initial outline of the inheritance vectors was identified by first applying a depth filter to remove variants outside the expected coverage distribution per sample; inheritance was then sketched out using a custom script, requiring a minimum of 10 SNVs supporting a particular haplotype, and manually refined to remove biologically unlikely haplotype blocks, or add additional haplotype blocks, where support existed, and refine haplotype coordinates. Missing recombinations were identified from the occurrence of blocks of pedigree-violating variants, matching the location of assembly-based recombination calls. We developed a hidden Markov model framework to identify the most probable sequence of inheritance vectors from SNV sites using the Viterbi algorithm. For details including the transition/emission probabilities see ref. 18 and the associated GitHub repository (https://github.com/Platinum-Pedigree-Consortium/Platinum-Pedigree-Inheritance).
The transition matrix defines the probability of a given inheritance state transition (recombination). The emission matrix defines the probability that a variant call at a particular locus accurately describes the inheritance state. The values contained within transition and emission matrices were refined to recapitulate the previously identified inheritance vectors, while correctly identifying missing vectors. The Viterbi algorithm identified 539 recombinations, a maternal recombination rate of 1.29 cM per Mb, and a paternal recombination rate of 0.99 cM per Mb. Maternal bias was observed in the pedigree, with 57% of recombinations identified in G3 of maternal origin.
Merging of meiotic recombination maps
Meiotic recombination breakpoints reported by different orthogonal technologies and algorithms (see the sections ‘Detection of meiotic recombination breakpoints using circular binary segmentation’ and ‘Detection of meiotic recombination breakpoints using inheritance vectors’) were merged separately for G2 and G3 samples. We started with the G3 recombination map where we used an inheritance-based map as a reference and then looked for support of each reference breakpoint in recombination maps reported based on PAV, Dipcall and Strand-seq (SSQ) phased variants. A recombination breakpoint was supported if for a given sample and homologue an orthogonal technology reported a breakpoint no further than 1 Mb from the reference breakpoint. Any recombination breakpoint that is further apart is reported as unique. We repeated this for the G2 recombination map as well. However, in the case of the G2 recombination map, we used a PAV-based map as a reference. This is because inheritance-based approaches need three generations to map recombination breakpoints in G3. We also report a column called ‘best.range’, which is the narrowest breakpoint across all orthogonal recombination maps that directly overlaps with a given reference breakpoint. Lastly, we report a ‘min.range’ column that represents for any given breakpoint a range with the highest coverage across all orthogonal datasets. Merged recombination breakpoints are reported in Supplementary Table 8.
Meiotic recombination breakpoint enrichment
We tested enrichment of all (n = 1,503) recombination breakpoints detected in G2–G4 with respect to T2T-CHM13 if they cluster towards the ends of the chromosomes depending on parental homologue origin. For this, we counted the number of recombination breakpoints in the last 5% of each chromosome end specifically for maternal and paternal breakpoints. We then shuffle detected recombination breakpoints along each chromosome 1,000 times and redo the counts. For the permutation analysis, we used the R package regioneR84 (v.1.32.0) and its function permTest with the following parameters:
permTest(
A=breakpoints, B=chrEnds.regions,
randomize.function=circularRandomizeRegions,
evaluate.function=numOverlaps,
genome=genome, ntimes=1000,
allow.overlaps=FALSE, per.chromosome=TRUE,
mask=region.mask, count.once=FALSE)
Refinement of meiotic recombination breakpoints using MSA
Up to this point, all meiotic recombination breakpoints were called using variation detected with respect to a single linear reference (GRCh38 or T2T-CHM13). To alleviate any possible biases introduced by comparison to a single reference genome, we set out to refine detected recombination breakpoints for each inherited homologue (in child) directly in comparison to parental haplotypes from whom the homologue was inherited from. We start with a set of merged T2T-CHM13 reference breakpoints for G3 only by selecting the ‘best.range’ column (Supplementary Table 8). Then, for each breakpoint, we set a ‘lookup’ region to 750 kb on each side from the breakpoint boundaries and used the SVbyEye85 (v.0.99.0) function subsetPafAlignments to subset PAF alignments of a phased assembly to the reference (T2T-CHM13) to a given region. We next extract the FASTA sequence for a given region from the phased assembly. We did this separately for inherited child homologues (recombined) and the corresponding parental haplotypes that belong to a parent from whom the child homologue was inherited from.
Next, we created a multiple sequence alignment (MSA) for three sequences (child-inherited homologue, parental homologue 1 and parental homologue 2) using the R package DECIPHER86 (v.2.28.0; with the function AlignSeqs). Fasta sequences of which the size differ by more than 100 kb or their nucleotide frequencies differ by more than 10,000 bases are skipped due to increased computational time needed to align such different sequences optimally using DECIPHER. After MSA construction, we selected positions with at least one mismatch and also removed sites where both parental haplotypes carry the same allele. A recombination breakpoint is a region where the inherited child homologue is partly matching alleles coming from parental homologues 1 and 2. We therefore skipped analysis of MSAs in which a child’s alleles are more than 99% identical to a single parental homologue. If this filter is passed, we use the custom R function getAlleleChangepoints (Code availability) to detect changepoints where the child’s inherited haplotype switches from matching alleles coming from parental haplotype 1 to alleles coming from parental haplotype 2. Such MSA-specific changepoints are then reported as a new range where a recombination breakpoint probably occurred. Lastly, we attempt to report reference coordinates of such MSA-specific breakpoints by extracting 1 kb long k-mers from the breakpoint boundaries and matching such k-mers against reference sequence (per chromosome) using R package Biostrings (v.2.70.2) with its function ‘matchPattern’ and allowing for up to 10 mismatches. A list of refined recombination breakpoints is reported in Supplementary Table 8.
Detection of allelic gene conversion using phased genome assemblies
We set out to detect smaller localized changes in parental allele inheritance using a previously defined recombination map of this family. We did this analysis for all G3 samples (n = 8) in comparison to G2 parents. For this, we iterated over each child’s homologue (in each sample) and compared it to both parental homologues from which the child’s homologue was inherited from. We did this by comparing SNV and indel calls obtained from phased genome assemblies between the child and corresponding parent. To consider only reliable variants, we retained only those supported by at least two read-based callers (either DeepVariant-HiFi, Clair3-ONT or dragen-Illumina callset). We further retained only variable sites that are heterozygous in the parent and were also called in the child. After such strict variant filtering, we slide by two consecutive child’s variants at a time and compare them to both haplotype 1 and haplotype 2 of the respective parent of origin. For this similarity calculation, we use the custom R function getHaplotypeSimilarity (Code Availability). Then, for each haplotype segment, defined by recombination breakpoints, we report regions where at least two consecutive variants match the opposing parental haplotype in contrast to the expected parental homologue defined by recombination map. We further merge consecutive regions that are ≤5 kb apart. For the list of putative gene conversion events, we retained only regions that have not been reported as problematic by Flagger. We also removed regions that are ≤100 kb from previously defined recombination events and events that overlap centromeric satellite regions and highly identical SDs (≥99% identical). Lastly, we evaluated the list of putative allelic gene conversion events by visual inspection of phased HiFi reads.
Read-based variant calling
PacBio HiFi data were processed with the human-WGS-WDL available at GitHub (https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v1.0.3). The pipeline aligns, phases and calls small variants (using DeepVariant87 v.1.6.0) and SVs (using PBSV v.2.9.0; https://github.com/PacificBiosciences/pbsv). We used the aligned haplotype-tagged HiFi BAMs for all downstream PacBio analysis.
Clair3
Clair3 (ref. 88) (v.1.0.7) variant calls were made based on the alignments with default models for PacBio HiFi and ONT (ont_guppy5) data, respectively, with phasing and gVCF generation enabled. Variant calling was conducted on each chromosome individually and concatenated into one VCF. gVCFs were then fed into GLNexus89 with a custom configuration file.
PacBio HiFi
run_clair3.sh –bam_fn=input.bam –sample_name=sample –ref_fn=input.ref –threads=8 –platform=hifi –model_path=/path/to/models/hifi –output=output.dir –ctg_name=contig –enable_phasing –gvcf
ONT
run_clair3.sh –bam_fn=input.bam –sample_name=sample –ref_fn=input.ref –threads=8 –platform=ont –model_path=/path/to/models/ont_guppy5 –output=output.dir –ctg_name=contig –enable_phasing –gvcf
ONT reads for Clair3 calling were aligned with minimap2 (v.2.21) with the following parameters: -L –MD –secondary=no –eqx -x map-ont.
Generation of truth set of genetic variation using inheritance vectors
We used a previously established framework to define ground truth genetic variation13. Our analysis, in contrast to trio-based filtering, uses all four alleles to detect genotyping errors, whereas, in a trio, only two alleles are transmitted and observed. By testing the genotype patterns in the third generation against the phased haplotypes of the first generation (A,B,C,D), we can test for the correct transmission of alleles from the second to third generations. We establish a map of the haplotypes across the third generation (inheritance vector) from which we can adjudicate variant calls against. To test for pedigree consistency, we implemented code that uses the inheritance vector as the expected haplotypes and test the possible genotype configurations within the query VCF file. Using the haplotype structure, we phase the pedigree consistent variants. These functions are implemented as a single binary tool that requires the inheritance vectors and a standard formatted VCF file, for example:
concordance -i ceph.grch38.hifi.g3.csv –father NA12877 –mother NA12878 –vcf input.vcf –prefix pedigree_filtered > info.stdout
The pedigree filtering and additional steps to build a small variant truth set are available at GitHub (https://github.com/Platinum-Pedigree-Consortium/Platinum-Pedigree-Inheritance).
Detection of small de novo variants
Following the parameters outlined previously10, we called variants in HiFi data aligned to T2T-CHM13 using GATK HaplotypeCaller90 (v.4.3.0.0) and DeepVariant87 (v.1.4.0) and naively identified variants unique to each G2 and G3 sample. We separated out SNV and indel calls and applied basic quality filters, such as removing clusters of three or more SNVs in a 1 kb window. We combined this set of variant calls generated by a secondary calling method (https://github.com/Platinum-Pedigree-Consortium/Platinum-Pedigree-Inheritance/blob/main/analyses/Denovo.md) and subjected all calls to the following validation process.
We validated both SNVs and indels by examining them in HiFi, ONT and Illumina read data, excluding reads that failed to reach the mapping quality (59 for long reads, 0 for short reads) thresholds. Reads with high base quality (>20) and low base quality (<20) at the variant site were counted separately. We retained variants that were present in at least two types of sequencing data for the child, and absent from high-base-quality parental reads. For SNV calls, we next examined HiFi data for every sample in the pedigree. We determined an SNV was truly de novo if it was absent from every family member that was not a direct descendant of the de novo sample. Finally, we examined the allele balance of every variant, determined which variants were in TRs and re-evaluated parental read data across all sequencing platforms, removing variants with noisy sequencing data or more than two low-quality parental reads supporting the alternative allele (Supplementary Note 9).
DNM phasing and postzygotic assignment
To determine the parent of origin for the de novo SNVs, we re-examined the long reads containing the de novo allele. First, we used our initial GATK variant calls to identify informative sites in an 80 kb window around the DNM, selecting any single-nucleotide polymorphisms (SNPs) where one allele could be uniquely assigned to one parent (for example, a site that is homozygous reference in a father and heterozygous in a mother). For every DNM, we evaluated every ONT and HiFi read that aligned to the site of the de novo allele and assigned it to either a paternal or maternal haplotype (if informative SNPs were available) by calculating an inheritance score as outlined previously10. DNMs that were exclusively assigned to maternal or paternal haplotypes were successfully phased, whereas DNMs on conflicting haplotypes were excluded from our final callset. Unphased variants were determined to be postzygotic in origin (n = 7) if their allele balance was not significantly different across platforms (by a χ2 test) and if their combined allele balance was significantly different from 0.5.
Once we assigned every read to a parental haplotype, we counted the number of maternal and paternal reads that had either the reference or alternative allele. We determined that a DNM was germline in origin if it was present on every read from a given parent’s haplotype. Conversely, if a DNM was present on only a fraction of reads from a parental haplotype, we determined that it was postzygotic in origin.
Sex chromosome DNM calling and validation
To identify DNMs on the X chromosome, we applied the same strategy as autosomal variants, with one exception: we used only variant calls generated by GATK. For male individuals, we reran GATK in haploid mode, such that it would only identify one genotype on the X chromosome.
To identify DNMs on the Y chromosome, we aligned male HiFi, ONT and Illumina data to the G1-NA12889 chromosome Y assembly and then called variants using GATK in haploid mode on the aligned HiFi data. We directly compared each male to his father, selecting variants unique to the son. We validated SNVs and indels by examining the father’s HiFi, ONT and Illumina data and excluded any variants that were present in the parental reads, applying the same logic that we used for autosomal variants.
Callable genome and mutation rate calculations
To determine where we were able to identify de novo variation in the genome, we assessed HiFi data for every trio. We first used GATK HaplotypeCaller90 (v.4.3.0.0) with the option ‘ERC BP_RESOLUTION’ to generate a genotype call at every site in the genome. Only sites where both parents were genotyped as homozygous reference (0/0) were considered callable, as sites with a parental alternative allele were excluded from our de novo discovery pipeline. We then examined the HiFi reads from a sample and its parents, restricting to only primary alignments with mapping quality of at least 59. For children, we only considered HiFi reads derived from blood, but we considered blood and cell line data for parents. We counted the number of reads with a minimum base quality score of 20 at every site in the genome and then combined this information with our variant calls. A site was deemed to be callable if both parents and the child each had at least one high-quality read with a high-quality base call. We observed an average of 2.67 Gb of accessible sequence across the autosomes (out of 2.90 Gb total, s.d. = 24.9 Mb). For female children, callable X chromosome was determined in the same way, whereas, for the male children, we only considered the mother’s HiFi data when examining the X chromosome and the father’s HiFi data when examining the Y chromosome. Moreover, male sex chromosomes were not restricted to sites where both parents were genotyped as reference—each parent was allowed to carry an alternative allele.
We calculated the germline autosomal mutation rate for every sample by dividing the number of germline autosomal DNMs by twice the number of base pairs we determined to be callable. For PZMs, we used the same denominator. In female individuals, the amount of callable sex chromosomes was defined as twice the number of callable bases on the X chromosome, and in males it was defined as the sum of the callable bases on the X and Y chromosomes. For each feature-specific mutation rate (such as SDs), we intersected both a sample’s de novo SNVs and the sample’s callable regions with coordinates of the relevant feature. We then calculated the mutation rate by dividing the number of SNVs in the region by the amount of callable genomic sequence where alignments could be reliably made.
Analysis of STRs and VNTRs
Given the challenges associated with assaying mutations in STRs (1–6 bp motifs) and VNTRs (≥7 bp motifs), we applied a targeted HiFi genotyping strategy coupled with validation by transmission and orthogonal sequencing.
Defining the TR catalogues
The command trf-mod -s 20 -l 160 reference.fasta was used, resulting in a minimum reference locus size of 10 bp and motif sizes of 1 to 2,000 bp (https://github.com/lh3/TRF-mod)91. Loci within 50 bp were merged, and then any loci >10,000 bp were discarded. The remaining loci were annotated with tr-solve (https://github.com/trgt-paper/tr-solve) to resolve locus structure in compound loci. Only TRs annotated on Chromosomes 1–22, X and Y were considered (Data availability).
TR genotyping with TRGT
TRGT32 is a software tool for genotyping TR alleles using PacBio HiFi sequencing reads (https://github.com/PacificBiosciences/trgt). Provided with aligned HiFi sequencing reads (in BAM format) and a file that enumerates the genomic locations and motif structures of a collection of TR loci, TRGT will return a VCF file with inferred genotypes at each TR locus. In this analysis, we ran TRGT (v.0.7.0-493ef25) on each member of the CEPH 1463 pedigree using the TR catalogue defined above. TRGT was run using the default parameters:
trgt –threads 32 –genome in_reference –repeats in_bed –reads in_bam –output-prefix out_prefix –karyotype karyotype`
bcftools sort -m 3072M -Ob -o out_prefix.sorted.vcf.gz out_prefix.vcf.gz
bcftools index –threads 4 out_prefix.sorted.vcf.gz
samtools sort -@ 8 -o out_prefix.spanning.sorted.bam out_prefix.spanning.bam
samtools index -@ 8 out_prefix.spanning.sorted.bam
Measuring concordant inheritance of TRs
To determine the concordant inheritance of TRs, we calculated the possible Manhattan distances derived from all possible combinations of a proband’s allele length (AL) from TRGT with both the maternal and paternal AL values. We considered a locus to be concordant if the minimum Manhattan distance from all computed distances was found to be 0, suggesting that a combination of the proband’s AL values matched the parental AL values perfectly. By contrast, if the minimum Manhattan distance was greater than 0, suggesting that all combinations of the proband’s AL values exhibited some deviation from the parental AL values, we regarded the locus as discordant and recorded it as a potential Mendelian inheritance error. For each TR locus, we calculated the number of concordant trios, the number of MIE trios and the number of trios that had missing values and could not be fully genotyped. Loci with any missing genotypes were excluded when calculating the percent concordance; however, individual complete trios were considered for de novo variant calling below.
Calling de novo TRs
We focused de novo TR calling on G3 for several reasons. First, their G2 parents (NA12877 and NA12878) were sequenced to 99 and 109 HiFi sequencing depths, resulting in a far lower chance of parental allelic dropout than samples with more modest sequencing depths. Second, G1 DNA was derived from cell lines, increasing the risk of artefacts when calling DNMs in G2. And finally, DNMs in the two individuals in G3 with sequenced children in our study can be further assessed by transmission.
We used TRGT-denovo33 (v.0.1.3), a companion tool to TRGT, to enable in-depth analysis of TR DNMs in family trios using HiFi sequencing data (https://github.com/PacificBiosciences/trgt-denovo). TRGT-denovo uses consensus allele sequences and genotyping data generated by TRGT and also incorporates additional evidence from spanning HiFi reads used to predict these allele sequences. In brief, TRGT-denovo extracts and partitions spanning reads from each family member (mother, father and child) to their most likely alleles. Parental spanning reads are realigned to each of the two consensus allele sequences in the child, and alignment scores (which summarize the difference between a parental read and a consensus allele sequence) are computed for each read. At every TR locus, each of the two child alleles is independently considered as a putative de novo candidate. For each child allele, TRGT-denovo reports the presence or absence of evidence for a de novo event, which includes the following: denovo_coverage (the number of reads supporting a unique AL in the child that is absent from the parent’s reads); overlap_coverage (the number of reads in the parents supporting an AL that is highly similar to the putative de novo allele); and magnitude of the putative de novo event (expressed as the absolute mean difference of the read alignment scores with de novo coverage relative to the closest parental allele).
Calculating the size of a de novo TR expansion or contraction
We measured the sizes of de novo TR alleles with respect to the parental TR allele that most likely experienced a contraction or expansion event. If TRGT-denovo reported a de novo expansion or contraction at a particular locus, we did the following to calculate the size of the event.
Given the ALs reported by TRGT for each member of the trio, we computed the difference in size (which we call a ‘diff’) between the de novo TR allele in the child and all four TR alleles in the child’s parents. For example, if TRGT reported ALs of 100,100 in the father, 50,150 in the mother, and 200,100 in the child, and the allele of length 200 was reported to be de novo in the child, the diffs would be 100,100 in the father and 150,50 in the mother. If we were able to phase the de novo TR allele to a parent of origin, we simply identify the minimum diff among that parent’s ALs and treat it as the likely expansion/contraction size. Otherwise, we assume that the smallest diff across all parental ALs represents the likely de novo size.
De novo filtering
We applied a series of filters to the candidate TR DNMs (identified by TRGT-denovo) to remove likely false positives. For each de novo allele observed in a child, we required the following (Supplementary Notes 9 and 10):
-
HiFi sequencing depth in the child, mother, and father ≥10 reads.
-
The candidate de novo AL in the child must be unique: as in ref. 37, we removed candidate de novo TR alleles if (1) the child’s de novo AL matched one of the father’s ALs and the child’s non-de novo AL matched one of the mother’s ALs or (2) the child’s de novo AL matched one of the mother’s ALs and the child’s non-de novo AL matched one of the father’s ALs.
-
The candidate de novo allele must represent an expansion or contraction with respect to the parental allele.
-
At least two HiFi reads supporting the candidate de novo allele (denovo_coverage ≥ 2) in the child, and at least 20% of total reads supporting the candidate de novo allele (child_ratio ≥ 0.2).
-
Fewer than 5% of parental reads likely supporting the candidate de novo AL in the child.
To calculate TR DNM rates in a given individual, we first calculated the total number of TR loci (among the ~7.8 million loci genotyped using TRGT) that were covered by at least 10 HiFi sequencing reads in each member of the focal individual’s trio (that is, the focal individual and both of their parents). We then divided the total count of de novo TR alleles by the total number of callable loci to obtain an overall DNM rate, expressed per locus per generation. Finally, we divided that rate by 2 to produce a mutation rate expressed per locus, per haplotype, per generation. As shown in Fig. 3a, we also estimated DNM rates as a function of the minimum motif size observed within a locus. For example, a locus with motif structure AT(n)AGA(n)T(n) would have a minimum motif size of 1. We counted the number of TR DNMs that occurred at loci with a minimum motif size of N and divided that count by the total number of TR loci with a minimum motif size of N that passed filtering thresholds. We then divided that rate by 2 to produce a mutation rate per locus, per haplotype, per generation. When calculating STR, VNTR and complex mutation rates, we defined STR loci as loci at which all constituent motifs were between 1 and 6 bp; we defined VNTR loci as loci at which all motifs were larger than 6 bp; and we defined complex loci as loci at which there were both STR (1–6 bp) and VNTR (≥7 bp) motifs. For example, both an A(n) locus and an AT(n)AGA(n)T(n) locus would be classified as STRs, as they both purely contain STR motifs.
Previous studies usually measured STR mutation rates at loci that are polymorphic within the cohort of interest. To generate mutation rate estimates that are more consistent with these previous studies, we also calculated the number of STR loci that were polymorphic within the CEPH 1463 pedigree. Loci were defined as polymorphic if at least two unique ALs were observed among the CEPH 1463 individuals at a given TR locus. We note that this definition of polymorphic STRs is sensitive to both the size of the cohort and the sequencing technology used to genotype STRs. As discussed in previous studies37, the number of polymorphic loci is proportional to the size of the cohort. Moreover, by defining loci as polymorphic if we observed more than one unique AL across the cohort, we may erroneously classify loci as polymorphic if HiFi sequencing reads exhibited a substantial amount of stutter at those loci, producing variable estimates of STR ALs across individuals. In total, 1,096,430 STRs were polymorphic within the cohort. To calculate mutation rates in each G3 individual, we applied the same coverage quality thresholds as described above.
Phasing of TRs
The STRs genotyped by TRGT were phased using HiPhase92 (v.1.0.0-f1bc7a8). We followed HiPhase’s guidelines for jointly phasing small variants, SVs and TRs by inputting the relevant VCF files from DeepVariant, PBSV and TRGT into HiPhase, resulting in three phased VCF files for each analysed sample. We also activated global realignment through the –global-realignment-cputime parameter to improve allele assignment accuracy. Note that HiPhase specifically excludes variants that fall entirely within genotyped STRs from the phasing process. This is motivated because STRs often encompass numerous smaller variants.
hiphase –threads 32 –io-threads 4 –sample-name sample_id –vcf in_vcf_deepvariant –vcf in_vcf_pbsv –vcf in_vcf_trgt –output-vcf out_vcf_deepvariant –output-vcf out_vcf_pbsv –output-vcfout_vcf_trgt –bam in_bam –reference in_reference –summary-file out_summary –blocks-file out_blocks –global-realignment-cputime 300
Parent-of-origin determination
We used the phased genotypes inferred by HiPhase to determine the likely parent of origin for de novo TR expansions and contractions. For each phased de novo allele that we observed in a child, we examined all informative SNVs in that child’s parents ±500 kb from the de novo allele. We defined informative sites using the following criteria: sites must be biallelic SNVs; total read depth in the mother, father and child must be at least 10 reads; Phred-scaled genotype quality in the mother, father and child must be at least 20; the child’s genotype must be heterozygous; and the parents’ genotypes must not be identical-by-state. Using the child’s phased SNV VCF, we then determine whether the child’s REF or ALT allele at the informative site was inherited from either the mother or father. For example, if the mother’s genotype is 0/0, the father’s genotype is 0/1 (note that the parental genotypes need not be phased), and the child’s genotype is 1|0, we know that the child’s first haplotype was inherited from the father and the second haplotype was inherited from the mother. We repeat this process for all informative sites within the ±500 kb interval. We then find the N informative sites that are (1) closest to the de novo TR allele (either upstream or downstream) while (2) supporting a consistent inheritance pattern in the child (that is, all support the same parent of origin for the child’s two haplotypes) and (3) all reside within the same HiPhase phase block (defined using the PS tag in the HiPhase output VCF). Finally, we use the phased TR VCF produced by HiPhase to check whether the de novo allele was phased to either the first or second haplotype in the child. We then confirm that the de novo allele shares the same PS tag as the informative sites identified above and use the N informative sites to determine whether the haplotype to which the de novo allele was phased was probably inherited from either the mother or the father.
Measuring concordance with orthogonal sequencing technology
At each candidate de novo TR allele, we calculated concordance between the de novo ALs estimated by TRGT and the ALs supported by Element, ONT or HiFi reads. We restricted our concordance analyses to autosomal TR loci with a single expansion or contraction (that is, we did not analyse ‘complex’ TR loci containing multiple unique expansions and/or contractions).
TRGT reports two AL estimates for every member of a trio at an autosomal TR locus, and TRGT-denovo assigns one of these two ALs to be the de novo AL in the child. At each TR locus, we calculated the difference between the length of the locus in the reference genome (in base pairs) and each of the two ALs in a given individual. We refer to the difference between the TRGT AL and the reference locus size as the relative AL. We then queried BAM files containing Element, Illumina, ONT or PacBio HiFi reads at each TR locus. Using the pysam library (https://github.com/pysam-developers/pysam), we iterated over all reads that completely spanned the TR locus and had a mapping quality of 60. To estimate the AL of a TR expansion/contraction in a read with respect to the reference genome, we counted the number of nucleotides associated with every CIGAR operation that overlapped the TR locus. For example, an Element read might have the following CIGAR string: 100M2D10M6I32M. For each of the CIGAR operations that overlap the TR locus, we increment a counter by OP * BP, where OP equals 0 for ‘match’ CIGAR operations, 1 for ‘insertion’ operations, and -1 for ‘deletion’ operations, and BP equals the number of base pairs associated with the given CIGAR operation. Thus, at each TR locus, we generated a distribution of net CIGAR operations in each member of the trio.
We used these net CIGAR operations to validate candidate de novo TR alleles in each child. For each de novo TR allele, we calculated the number of Element reads in the child that supported the de novo AL estimated by TRGT (allowing the Element reads to support the de novo AL ± 1 bp). We then calculated the number of Element reads in that child’s parents supporting the de novo AL (also allowing for off-by-one errors). If at least one Element read supported the de novo TR AL in the child, and zero Element reads supported the de novo TR AL in both parents, we considered the de novo TR to be validated.
Validating recurrent TR DNMs
To assemble a confident list of candidate recurrent de novo TR alleles, we first assembled a list of TR loci where two or more CEPH 1463 individuals (in either G2, G3 or G4) harboured evidence for a de novo TR allele. For each candidate locus, we then required that all members of the CEPH 1463 pedigree were genotyped for a TR allele at the locus and had at least 10 aligned HiFi reads at the locus. These filters produced a list of 49 candidate loci where we observed evidence of either intragenerational or intergenerational recurrence. We visually inspected HiFi read evidence using the Integrated Genomics Viewer (IGV)93, as well as bespoke plots of HiFi CIGAR operations, at each locus to determine whether the candidate de novo TR alleles seemed plausible.
Detection and filtering of de novo SVs
We attempted to obtain putative de novo SVs from three different sources. The first one is based on reporting de novo SVs from read-based callsets (PBSV (v.2.9.0), Sniffles94 (v.0.12.0), Sawfish95 (v.2.2)). The second reports putative de novo SVs from variants called in phased genome assemblies. The last used pangenome graphs constructed from phased genome assemblies to report de novo SVs.
Assembly-based detection of de novo SVs
-
(1)
SVPOP8 (v.3.4.0) (https://github.com/EichlerLab/svpop) was used to produce a merged PAV callset across all samples. It merges a single source (single SV caller) across multiple samples. The merge definition used was: nr::ro:szro:exact:match. The samples were provided in this order (G1–G2–G3): NA12889, NA12890, NA12891, NA12892, NA12877, NA12878, NA12879, NA12881, NA12882, NA12883, NA12884, NA12885, NA12886, NA12887.
-
(2)
For each sample in G3, we selected variants unique to that sample alone.
-
(3)
To compare variant calls against the previous generation, SVPOP was used again to do a PBSV/PAV intersection. This involved intersecting the PAV calls for G3 with the PBSV calls for G2, comparing each sample in G3 against each sample in G2.
-
(4)
The callable BED files from PAV, intersections with G2’s PBSV calls, and the list of putative de novo calls went into our validation pipeline.
-
(5)
The pipeline (1) checks if the putative de novo variant was called by PBSV in either parent. (2) Checks if the putative de novo variant is seen in HiFi reads in either parent by running subseq (https://github.com/EichlerLab/subseq). (3) Checks if the variant was in a callable region in either parent. (4) Performs an MSA using DECIPHER of the two haplotypes of the sample, and both parents, in the location of the SV with 1,000 bp flank on either side.
Pangenome graph detection of de novo SVs
Verkko assemblies were partitioned by chromosome by mapping them against the GRCh38, T2T-CHM13 and HG002 (v.1.0.1) human reference genomes using WFMASH (v.0.13.1-251f4e1) pangenome aligner. On each set of contigs, we applied PGGB (v0.6.0-87510bc) to build chromosome-level unbiased pangenome variation graphs96 with the following parameters: -s 20k -p 95 -k 47 -V chm13:100000, grch38:100000. We used the Variation graph toolkit97 (v.1.40.0) to call variants from the graphs with respect to both the T2T-CHM13 and GRCh38 reference genomes. Variants were then decomposed by applying VCFBUB (v.0.1.0-26a1f0c) to retain those found in top-level bubbles that are anchored on the genome used as reference, and VCFWAVE (v.1.0.3) to homogenize SV representation across samples. Subsequently, raw VCF files were used as an input for pedigree-based filtering of putative de novo SVs.
De novo SV filtering in SV callsets (PGGB, PAV, PBSV, Sniffles, Sawfish)
Filtering of de novo SVs was done using BCFtools (v.1.17) +fill-tags followed by filtering the joint-called VCF for singleton-derived alleles at sites where all samples had a genotype call. By considering all G2/G3 family members (not just trios), we increased de novo SV specificity. We used the command line:
bcftools view -i ‘INFO/AC = 1′ VCF FILE | bcftools +fill-tags — -t ‘all,F_MISSING’ | bcftools view -i ‘F_MISSING = 0.0′ –max-alleles 2 | bcftools view –samples SAMPLE | bcftools +fill-tags | bcftools view -i ‘INFO/AC = 1′ | bcftools view -i ‘(ILEN < -49 || ILEN > 49)’ | bcftools view -i ‘QUAL > 49′ | vcf2tsv
All candidate de novo SVs collected across all regions of the genomes were further evaluated using phased genome assemblies and long-read alignments. Further details are provided in Supplementary Note 10.
Extracting donor site of de novo SVA insertion
We first extracted an inserted SVA element in the de novo Verkko assembly of NA12887 (maternal haplotype, haplotype 1). Next, we used minimap2 (ref. 80) (v.2.24) to align this ~3.4-kb-long piece of DNA to both maternal and paternal Verkko assemblies using the parameters reported below:
minimap2 -x asm20 -c –eqx –secondary=yes assembly.fasta sva.fasta > output.paf
With these parameters we reported all locations of this DNA segment. We defined a putative donor site as an alignment position in maternal haplotype that has nearly perfect match with SVA de novo insertion.
Analysis of centromeric regions
To identify completely and accurately assembled centromeres from each genome assembly, we first aligned the genome assemblies generated via Verkko16 or hifiasm (UL)17 to the T2T-CHM13 reference genome1 using minimap2 (ref. 80) and the following parameters: -a –eqx -x asm20 -s 5000 -I 10G -t threads. We then filtered the whole-genome alignments to only those contigs that aligned to the centromeres in the T2T-CHM13 reference genome. We checked whether these centromeric contigs spanned the centromeres by checking to see whether they contained sequence from the p- and the q-arms in the regions directly adjacent to the centromere. We then validated the assembly of the centromeric regions by aligning native PacBio HiFi data from the same source genome to each whole-genome assembly using pbmm2 (v.1.1.0; https://github.com/PacificBiosciences/pbmm2) and the following command: align –log-level DEBUG –preset SUBREAD –min-length 5000 -j threads, and next assessed the assemblies for uniform read depth across the centromeric regions via NucFreq77 (v.0.1). We also aligned native ONT data >30 kb in length from the same source genome to each whole-genome assembly using minimap2 (v.2.28) and assessed the assemblies for uniform read depth across the centromeric regions using IGV browser93.
To identify de novo SVs and SNVs within each centromeric region, we first aligned each child’s genome assembly to the relevant parent’s genome assembly using minimap2 and the following parameters: -a –eqx -x asm20 -s 5000 -I 10G -t threads. We then used the resulting PAF file to identify de novo SVs and SNVs using SVbyEye85 (v.0.99.0), filtering our results to only those centromeres that were completely and accurately assembled. We checked each SV and SNV call with NucFreq, Flagger9 and native ONT data to ensure that the underlying data supported each call. Further details are provided in Supplementary Notes 9 and 10.
Analysis of telomeric regions
We processed all G1, G2 and G3 assemblies with Tandem Repeats Finder (TRF)91 to determine the existence of the canonical telomeric repeat (p-arm, CCCTAA; q-arm, TTAGGG) within the distal regions of each assembled contig; TRF (v.4.09.1) was run with parameters: ‘2 7 7 80 10 50 10 -d -h-ngs’, recommended for young (in this context, non-deteriorated) repeats as implemented in RepeatMasker (v.4.1.6). The assembled contigs, in turn, were aligned to the T2T-CHM13 reference with minimap2 (ref. 80) (v.2.24) using the asm20 preset to establish the identities of each sequence (that is, whether a given contig represented the whole reference chromosome or a part of it, and whether it should be reverse-complemented to represent it canonically). With identities established, TRF annotations were crawled from the outside in (from the 5′ end on p-arms and from the 3′ end on q-arms, with respect to reverse complementarity as reported by minimap2) until the canonical repeat was encountered; incidences of non-canonical interspersed repeats were also retained.
Moreover, PacBio HiFi reads were mapped to the contigs to assess by how many HiFi reads each region of each assembly was supported (coverage depth); distal regions supported by fewer than five HiFi reads were masked. Of the non-acrocentric chromosome ends across all G1, G2 and G3 samples, 74.2% of the Verkko assemblies (893 out of the possible 1,204 across all participants and haplotypes) were found to terminate in a canonical telomeric repeat (either spanning from the very start or end of the contig, or immediately adjacent to the region masked due to low coverage) with the median length of such repeats being 5,608 bp (Supplementary Table 3). Moreover, out of the T2T-CHM13 chromosomes for which both p and q telomeric ends were recovered, 64.6% (221 out of 342) were represented each by a single assembled contig spanning from the p telomere to the q telomere.
The G4 hifiasm assemblies were processed in the same fashion; however, only 56.8% of the telomeric regions (342 out of the possible 602) were recovered (Supplementary Fig. 3) with a median length of the canonical repeat being 4,674 bp (Supplementary Table 3; same as for G1–G3), and the contiguity was markedly worse: only one chromosome (chromosome 9 in haplotype 1 of individual G4-200101) was verifiably spanned by a single contig (h1tg000017l).
CpG methylation analysis
To determine the CpG methylation status of each centromere, we first base called raw ONT data with Guppy (https://community.nanoporetech.com; v.6.5.7) using the sup-prom model and the dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup_prom.cfg config file. Next, we aligned the ONT data from each sample to the respective genome assembly using minimap2 (ref. 80) (v.2.28) with the following parameters: -ax lr:hq -y -t 4 -I 8 g. We converted the resulting BAM file to a bedMethyl file using modbam2bed (https://github.com/epi2me-labs/modbam2bed) and the following parameters: -e -m 5mC –cpg -t threads input.bam > output.bed. Next, we converted the bedMethyl file into a bedGraph using the following command: awk ‘BEGIN OFS=“\t”; print $1, $2, $3, $11’ input.bed | grep -v “nan” | sort -k1,1 -k2,2n > output.bedgraph and subsequently converted the bedGraph into a bigwig using bedGraphToBigWig (https://www.encodeproject.org/software/bedgraphtobigwig/) and then visualized the bigwig file using Integrative Genomics Viewer93,98 (v.2.16.0). To determine the size of a hypomethylated region (termed the CDR2,39) in each centromere, we used CDR-Finder (https://github.com/arozanski97/CDR-Finder), which first bins the bedGraph into 5 kb windows, computes the median CpG methylation frequency within windows containing α-satellite (as determined by RepeatMasker99 (v.4.1.0)), selects bins that have a lower CpG methylation frequency than the median frequency in the region, merges consecutive bins into a larger bin, filters for merged bins >50 kb and reports the location of these bins.
Y-chromosomal analysis
Construction and dating of Y phylogeny
The construction and dating of Y-chromosomal phylogeny for 58 total samples, combining the 14 pedigree males from the current study with 44 individuals, for which long-read-based Y assemblies have previously been published, was done as described previously in detail52. In brief, all sites were called from the Illumina high-coverage data14 of the 14 pedigree males using the approximately 10.4 Mb of Y-chromosomal sequence previously defined as accessible to SRS100. BCFtools101,102 (v.1.16) was used with a minimum base quality 20, mapping quality 20 and ploidy 1. SNVs within 5 bp of an indel call (SnpGap) and all indels were removed, followed by filtering all calls for a minimum read depth of 3 and a requirement of ≥85% of reads covering the position to support the called genotype. The VCF was merged with a similarly filtered VCF from ref. 52 for the 44 individuals using BCFtools, and then sites with ≥5% of missing calls, that is, missing in more than 3 out of 58 samples, were removed using VCFtools103 (v.0.1.16). After filtering, a total of 10,404,104 sites remained, including 13,443 variant sites.
The Y haplogroups of each sample were predicted as previously described104 and correspond to the International Society of Genetic Genealogy nomenclature (ISOGG; https://isogg.org; v.15.73). A coalescence-based method implemented in BEAST105 (v.1.10.4) was used to estimate the ages of internal nodes. RAxML106 (v.8.2.10) with the GTRGAMMA substitution model was used to construct a starting maximum-likelihood phylogenetic tree for BEAST. Markov-chain Monte Carlo samples were based on 200 million iterations, logging every 1,000 iterations, with the first 10% of iterations discarded as a burn-in. A constant-sized coalescent tree prior, the GTR substitution model, accounting for site heterogeneity (gamma), and a strict clock with a normal distribution based on the 95% CI of the substitution rate (0.76 × 10−9 (95% CI = 0.67 × 10−9–0.86 × 10−9) single-nucleotide mutations per base pair per year) was used107. A summary tree was produced using Tree-Annotator (v.1.10.4) and visualized using the FigTree software (v.1.4.4).
Identification of sex-chromosome contigs
Detailed analysis of Y-chromosomal DNMs focused on seven male individuals (R1b1a-Z302 Y haplogroup, G1-NA12889, G2-NA12877, G3-NA12882, G3-NA12883, G3-NA12884 and G3-NA12886) for whom phased Verkko assemblies were generated. Contigs containing X- and Y-chromosomal sequences were identified and extracted from the whole-genome assemblies as previously described52. Moreover, the pseudoautosomal regions from the G1 grandmother NA12890 and G2 mother NA12878 genome assemblies were identified by aligning the respective sequences from the T2T-CHM13 reference genome to these assemblies using minimap2 (ref. 80) (v.2.26).
Annotation of Y-chromosomal subregions
The annotation of Y-chromosomal subregions of the Verkko assemblies was performed using both the GRCh38 and T2T-CHM13 Y reference sequences as previously described52. The centromeric α-satellite repeats for the purpose of Y subregion annotation were identified using RepeatMasker99 (v.4.1.2-p1) with the default parameters. The Yq12 repeat annotations were generated using HMMER108 (v.3.3.2dev) with published DYZ1, DYZ2, DYZ18, 2k7bp and 3k1bp sequences52, followed by manual checking of repeat unit orientation and distance from each other. Dot plots to compare Y-chromosomal sequences were generated using Gepard109 (v.2.0).
Detection and validation of DNMs
Human Y chromosomes vary extensively in the size and composition of repetitive regions52, including the T2T-CHM13 Y (haplogroup J1a-L816) and the R1b1a-Z302 haplogroup Y chromosomes carried by the seven pedigree males (Supplementary Note 6). For this reason, the Y assembly of the G1 grandfather NA12889 was used as a reference for DNM detection. The DNMs were called from the Y assemblies of five G2 (NA12877) and G3 (NA12882, NA12883, NA12884 and NA12886) males using Dipcall82 (v.0.3) with the default parameters recommended for male samples. Variants were identified from the MSY only, that is, the pseudoautosomal regions were excluded from this analysis. All identified variants were filtered as follows: any variant calls overlapping with regions flagged by Flagger or NucFreq in either reference or query assembly were filtered out.
For SNVs, the final filtered calls were supported by 100% of HiFi reads (that is, no reads supported the reference allele in offspring or alternative allele in the father) and ONT reads mapped to both the reference and each individual assembly were checked for support.
For indels (≤50 bp), homopolymer tracts were excluded from the analysis, while the rest of the calls were validated using the read data (HiFi, ONT, Illumina) as follows. Individual reads mapped to the reference (G1 NA12889 Y assembly) and covering the indel call plus 150 bp of flanking sequence were extracted from all samples using subseq (https://github.com/EichlerLab/subseq), followed by alignment using MAFFT110,111 (v.7.508) with the default parameters. All alignments were manually checked and any calls where the HiFi data had two or more reads supporting a reference allele and one or more reads supporting an alternative allele were removed. All final SNV and indel calls were additionally supported (if unique mapping to the region was possible) by both Illumina and Element read data mapped to the reference.
For all SV calls, HiFi read depth for reference and alternative alleles were visualized and SVs in regions showing high levels of read depth variation coinciding with clusters of SNVs with >10% of reads supporting an alternative allele removed. HiFi and ONT reads mapped to both the reference and individual assemblies were checked for support.
For all variants, concordance with the expected transmission through generations was confirmed. Moreover, the HiFi data available for three G4 male individuals (200101, 200102 and 200105) were checked for support of the identified variants.
Y-chromosomal DNM rate calculation
The assembly-based DNM rates were calculated for each of the five male individuals based on the accessible regions of each individual Y assembly (that is, any regions flagged by Flagger and/or NucFreq were removed).
Mobile element analysis
Mobile element analysis was performed on PacBio HiFi reads using xTea112 (v.0.1.9). Potential non-reference mobile element insertions (MEI) identified with xTea were visualized using IGV to ensure that the insertions were identifiable in the sequencing reads and to determine whether any of these events were de novo. Using BEDTools113, we intersected the non-reference insertions with introns, exons, 5′-UTRs and 3′-UTRs from T2T-CHM13. To identify potential source elements of the non-reference LINE-1 insertions, we used BLAT114 to find the best matching insertion in the T2T-CHM13 reference genome. If there were multiple matches in the reference genome that had the same score, a source element was not called. MEI sequences representing known Alu, L1 and SVA subclasses were obtained from previous work115, Dfam116 and UCSC Genome Browser74. Reference and novel sequences for each MEI class were combined into class-specific files. Sequences were oriented to plus-strand. Highly truncated sequences were removed. MEI sequences were aligned using the MUSCLE117 (v.3.8.31) aligner. Pairwise distances among MEI sequences were calculated using a Kimera two-parameter method and then converted to correlations. Principal components were obtained by eigenvalue decomposition of the pairwise correlation matrix. The first three principal components were plotted to visualize the relationships among the non-reference MEIs and the known MEI subfamily sequences.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.