Cell line and reagents
HAP1 cells (Horizon Discovery) were maintained in IMDM with 10% FBS and 1% penicillin–streptomycin. For haploidy sorting, 1 × 10−7 HAP1 cells were resuspended in 5 mg ml–1 Hoechst 34580 (BD, 565877) and sorted at 4 °C. HAP1 cells were transfected using Turbofectin 8.0 (Origene). All oligonucleotides and primers were synthesized by Integrated DNA Technologies.
Generation of site-saturation mutagenesis libraries and Cas9–sgRNA plasmids
Exons 15–26 encoding the BRCA2 DBD, and adjacent upstream and downstream 10-bp intronic regions flanking each exon, were selected for SGE. Exons 18 and 25 were split into amino-terminal-targeted and carboxy-terminal-targeted regions because of their large exon size, which resulted in a total of 14 SGE target regions. Multiple sgRNAs were designed using the Benchling design tool. sgRNA-annealed oligonucleotides were ligated into pSpCas9(BB)-2A-Puro (PX459 v.2.0) (Addgene, 62988) following BbsI (New England Biolabs, R0539L) digestion to create a Cas9–sgRNA co-expression construct for each individual SGE. For each SGE, 600−1,000 bp homologous arms upstream and downstream of the target region were amplified from wild-type HAP1 gDNA and cloned into a BamHI-HF-digested pUC19 vector using a NEBuilder HiFi DNA assembly Cloning kit. Cloned plasmid backbones were subjected to site-saturation mutagenesis by inverse PCR34 using mutagenized codon NNN primers for all possible nucleotide changes at each amino-acid position. A protospacer protection edit encoding a silent mutation was introduced by site-directed mutagenesis into the protospacer adjacent motif site or the sgRNA recognition site of each target region to prevent re-cutting by the Cas9–sgRNA after successful editing. Furthermore, a single 3-nucleotide mutation was introduced into the introns of each homologous arm to facilitate specific reamplification of the targeted DNA.
CRISPR–Cas9 SGE
Multiple sgRNAs with predicted high editing efficiencies in HAP1 cells were evaluated in SGE experiments of each target region and the optimal sgRNAs were selected (Supplementary Table 1). In each SGE experiment, 5 million haploid-sorted HAP1 cells were co-transfected with 4 mg of the target-specific variant library and 16 mg of the Cas9–sgRNA targeting construct. Cells were selected in puromycin (1 mg ml–1) for 3 days. Cells were collected at D0, D5 (24 h after puromycin selection) and D14 after transfection, and gDNA was extracted using a Monarch Genomic DNA Purification kit (New England Biolabs, T3010L). Target regions were amplified by PCR to add barcodes for multiplexing. All PCR reactions were performed in 50 μl reactions using Q5 High-Fidelity 2× master mix (New England Biolabs, M0492L). Primers for gDNA amplification are provided in Supplementary Table 2. All reactions were cleaned and concentrated using Ampure XP beads before sequencing for 150 cycles on an Illumina MiSeq (approximately 5 million reads per run) or NextSeq (approximately 30 million reads per run) instrument. Base calls were performed using the instrument control software and further processed using a customized algorithm.
Sequencing data processing
FASTQ files of sequenced samples from Illumina MiSeq or NextSeq assays were trimmed for adapter sequences using cutadapt (v.3.5). SeqPrep (v.1.2) converted the paired-end reads into single reads. The single reads were aligned to the human reference genome (GRCh38) utilizing bwa-mem (v.0.7.17). Following alignment, the custom-developed tool CountReads was used for DNA-sequencing data analyses, with a particular focus on the identification and characterization of mutations. CountReads included the preparation of reference amino acid and DNA sequences, validation of sequencing data integrity and precise trimming of reads to relevant regions. The method also differentiated between variant types and confirmed the presence of specific variants and aggregated and reported variant data. CountReads produced a variant call format (VCF) file, which was annotated using CAVA35. The SpliceAI tool (v.1.3.1)36 was utilized to evaluate splicing effects associated with all observed SNVs.
Functional read count process
The log2 ratio between the frequency of D14 and D0 read counts was used to measure the depletion or enrichment effect for each variant. The comparison between experimental D0 and D5 was used for positional adjustment using a Loess transformation6. Variants with under-represented read counts (<10) at D0 and D5 were excluded from further analysis. log2 ratios of variants were linearly scaled within each exon across replicate experiments relative to median silent and median nonsense SNV values. For each variant, the average score was calculated from all non-missing values among replicates. Linear scaling was used to normalize scores across exons using median synonymous and nonsense values, similar to the within exon normalization. After completion of all data cleaning and quality control, a raw functional score was available for 6,959 SNVs (Supplementary Table 3).
VarCall model for assessment of evidence of pathogenicity
Replicate-level variant frequencies were computed at each assay time point (D0, D5 and D14) by dividing the variant read count by the replicate total for each exon. To remove positional bias, the positional effect was estimated using the ratio between D0 and D5 read counts, using replicate-level generalized additive models with exon-specific adaptive splines21. The VarCall model37 was applied to the positionally adjusted log ratio of the D14 and D0 read counts. VarCall is a class of Bayesian hierarchical model with context-specific measurement models that embed a Gaussian two-component mixture model for the variant effects. The formulation used here is based on a previous analysis of BRCA2 variants8. Variants were each assigned a binary indicator of pathogenicity status: deterministically if assumed known and probabilistically if not. Silent variants were assumed benign and nonsense variants pathogenic. The measurement model adjusted for batching by including replicate by exon-level location and scale random effects and included t-distributed error terms to allow for outliers. The JAGS language38 was used to specify and fit the VarCall model using a MCMC algorithm. All related computations were carried out in the R programming language22. A prior probability of pathogenicity of 0.2 for variants in the DNA-binding region was used based on a predicted frequency of 0.23 for pathogenic variants in this region by AlphaMissense. Using the MCMC output, the Bayes factor in favour of pathogenicity for each variant was computed. The thresholds for the Bayes factor based on strength of evidence of pathogenicity or benign level (PStrong, PModerate or PSupporting, VUS, BStrong, BModerate or BSupporting) were derived from the Bayesian interpretation of the ACMG–AMP guidelines23. Full details of the analysis are available in the Supplementary Methods.
Three-dimensional structural modelling
BRCA2 functionally PStrong missense alterations were mapped in the DBD using PyMol software. The Protein Data Bank source file (identifier 1MJE) was downloaded from the NCBI Molecular Modeling Database. Three-dimensional structural modelling was based on the crystal structure of a BRCA2–DSS1–ssDNA complex39.
Multi-species amino-acid sequence conservation and in silico pathogenicity prediction
BRCA2 amino-acid sequences were obtained from Align-GVGD (http://agvgd.hci.utah.edu/). Sequence alignments were performed using ten species: Homo sapiens, Pan troglodytes, Macaca mulatta, Rattus norvegicus, Canis familiaris, Bos taurus, Monodelphis domestica, Gallus gallus, Xenopus laevis and Tetraodon nigroviridis. Sequence conservation analyses were performed on amino-acid residues that contained BRCA2 DBD functionally pathogenic variants. Align-GVGD26, AlphaMissense27 and Bayes-Del40 were used for in silico pathogenicity prediction.
Study populations
Breast cancer and ovarian cancer cases and associated clinical phenotypes were collected from individuals receiving cancer genetic testing by Ambry Genetics. Publicly available reference controls were women from gnomAD (v.2.1, v.3.1 and v.4 excluding the UK Biobank). Matching case–control data for breast cancer were also available from the CARRIERS and BRIDGES population-based breast cancer studies2,29, and breast cancer case–control data from the UK Biobank (www.ukbiobank.ac.uk). Variants with an allele frequency of >0.001 were excluded from the analyses.
Comparison with other BRCA2 functional assays
SGE functional results were compared with those from other studies, including a BRCA2-deficient cell-based HDR assay7, a BRCA2-deficient cell line–based drug assay24, a prime-editing-based SGE study16 and a mouse embryonic-stem-cell-based functional analysis25.
ACMG–AMP framework for classification of BRCA2 DBD variants
The ACMG–AMP rule-based framework combines evidence from population, computational and predictive, segregation, functional, and other data, with each contributing source weighted as very strong (PVS1), strong (PS1, PS2, PS3 and PS4), moderate (PM1, PM2, PM3, PM4, PM5 and PM6) or supporting (PP1, PP2, PP3, PP4 and PP5) evidence for pathogenic effects, or stand-alone (BA1), strong (BS1, BS2, BS3 and BS4) or supporting (BP1, BP2, BP3, BP4, BP5, BP6 and BP7) for benign effects. The combined data produce variant classifications of benign, LB, pathogenic, LP and VUS9. In this study, ACMG–AMP scoring rules established by the ClinGen BRCA1/2 VCEP were used for clinical classification of BRCA2 DBD SNVs. The BRCA2 functional data were integrated into the ClinGen–ACMG–AMP BRCA1/2 VCEP classification model under the PS3/BS3 rule. The values for functional evidence were capped at +4 and –4 on the log scale to avoid LP or LB classification with functional evidence alone. The study was approved by the Western Institutional Review Board, which exempted review of the clinical testing cohort, and by the Mayo Clinic Institutional Review Board (21-008216). Detailed ACMG–AMP criteria used in this study are provided in the Supplementary Methods.
Tumour LOH analysis
LOH status for breast, ovarian, pancreatic, and prostate cancer tumours carrying germline BRCA2 DBD variants was acquired from tumour–normal paired sequencing using the IMPACT dataset32. The FACETS algorithm41 was used to determine LOH from matched tumour–normal pairs. Only tumour samples with >40% tumour content were included in the analysis.
Statistical analysis
Associations between variant classification groups in BRCA2 and the risk of breast cancer or ovarian cancer were performed for women who received genetic testing from Ambry Genetics and for women without cancer in gnomAD (v.2.1, v.3.1 and v.4 (excluding UK Biobank, from v.4)) using weighted logistic regression of control populations and weighting for the relative frequencies of different races and ethnicities in the cases. Associations in the population-based CARRIERS and BRIDGES matched breast cancer cases and unaffected women (as controls) and for UK Biobank breast cancer cases and controls were performed using Fisher’s exact test. Phenotypic comparisons between cases with functionally pathogenic and benign variants were conducted using Student’s t-test for quantitative variables and a Chi-squared test for qualitative variables. Lifetime absolute risks of breast cancer or ovarian cancer (malignant epithelial tumours of the ovary or fallopian tube) up to age 80 years were estimated for different classification groups by incorporating OR estimates with age-specific breast cancer or ovarian cancer incidence rates (restricted to individuals who identified as non-Hispanic white) from the SEER Program of the National Cancer Institute, accounting for all-cause mortality rates2. One-way analysis of variance tests were conducted to compare the functional score differences of functional categories from other BRCA2 functional assays. Fisher’s exact tests were used in tumour LOH analysis. All analyses were performed with R software (v.4.2.2) and all tests were two-sided. SGE data in bar graphs or scatter plots are presented as means from replicate experiments.
Ethics statement
All data shown in this paper are provided with the explicit written consent of the study participants following approval from the institutional review boards.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.