Thursday, June 18, 2026
No menu items!
HomeNatureAnalysis of 173,303 exomes and genomes in the Pakistan Genome Resource

Analysis of 173,303 exomes and genomes in the Pakistan Genome Resource

Ethics statement

The institutional review board (IRB) at the Center for Non-Communicable Diseases (CNCD) (IRB: 00007048, IORG0005843 and FWAS00014490) approved the study. This study has also been approved by the National Bioethics Committee for Research Pakistan (reference number NBC-756). All participants gave written informed consent, which includes publication of relevant indirect identifiers.

Recruitment

Cases and control individuals were recruited in both public and private hospitals across Pakistan. A breakdown of participants is outlined in Supplementary Table 1. Phenotypes were defined based on confirmed clinical diagnoses established by primary physicians, supported by relevant laboratory investigations and standardized diagnostic criteria. Participants were identified at the time of hospital presentation and enrolled following diagnostic confirmation and written and oral informed consent was obtained. Cases were identified according to predefined inclusion and exclusion criteria. Control individuals were selected from the same hospitals, typically unrelated or distantly related case attendants who met control criteria. To minimize selection bias, enrolment was carried out across both urban and suburban hospitals that also serve surrounding rural populations.

PGR sequencing

Exome sequencing was performed in multiple batches at two different locations: the Broad Institute as described previously4 and the Regeneron Genetics Center (RGC) as described previously using either IDT probes or the Twist human comprehensive exome panel75,76,77. Whole-genome sequencing (WGS) was performed as part of the TOPMed consortium78.

PGR sample quality control, variant quality control and variant annotation

Samples were removed if there was discordance between their reported and computationally predicted sex, or if they were predicted to be genetic duplicates within or across sequencing batches. In the RGC datasets, samples were additionally removed if fewer than 80% of bases were covered at 20× or >5% human contamination was observed79. For a subset of the RGC datasets, whole plates were removed if self-reported gender was discordant with computationally predicted sex for >10% of samples. After filtering, a total of 166,625 WES and 6,678 WGS samples were analysed. Variant-level quality control was applied separately to each sequencing batch. Before applying quality control filters, multiallelic sites were decomposed and all variants were left-aligned and normalized. In the WES dataset sequenced at the Broad Institute, individual variant calls were set to missing if they did not meet all of the following criteria: Quality by Depth (QD) ≥ 2 for insertion–deletion mutations (indels), or QD ≥ 3 for single nucleotide variants (SNVs); Variant Quality Score Recalibration (VQSR) = “PASS” or (VQSR = “VQSRTrancheSNP99.60to99.80” and allele count = 1); 0.25 ≤ allele balance ≤ 0.75 for heterozygous calls, or allele balance ≥ 0.9 for homozygous calls; Depth (DP) ≥ 10. For a batch of WGS data sequenced at the Broad Institute, calls with DP < 10 were removed. In the datasets sequenced at the RGC, individual variant calls were set to missing if they did not meet all of the following criteria: 0.25 ≤ allele balance ≤ 0.75 for heterozygous calls, or allele balance ≥ 0.9 for homozygous calls; DP ≥ 7 for SNVs, or DP ≥ 10 for indels. After applying the above variant quality control steps, variants were removed entirely if they had a call rate <0.9. All variant quality control steps were performed using Hail. Variants that were filtered in gnomAD v4.1 were removed from PGR. Variants were annotated using Variant Effect Predictor (VEP) v107 and LOFTEE13,80. All variant annotations are reported relative to the Ensembl canonical transcript v107.

Admixture analysis

A reference panel was created using the individuals of specific subgroups from the 1KG dataset: SAS, n = 492; EAS, n = 515; AFR, n = 671; and EUR, n = 521 (ref. 15); 752 Central or South Asian, European, Middle Eastern, or African participants from the HGDP16,17; and 200 unrelated individuals from each self-reported ethnicity within PGR. Data from 1KG were downloaded from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ and https://www.internationalgenome.org/data-portal/sample. HGDP VCF files were downloaded from https://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/. The dataset was merged after filtering to variants with MAF > 1% in each of 1KG, HGDP, and each PGR sequencing tranche, and then linkage disequilibrium-pruned using the PLINK (v2.00a3.7) flag ‘–indep-pairwise’ with a window size of 200 kb and an r2 threshold of 0.5 (ref. 81). Unsupervised clustering analysis using ADMIXTURE82 was run on this set of reference samples for k = 3 to k = 25 using fivefold cross-validation. k = 17 was selected as it minimized cross-validation error. The remaining PGR samples were then projected into the k = 17 results using ‘-P’.

Pairwise population F
ST calculations

The Weir and Cockerham weighted FST value14 was calculated using VCFtools (v0.1.16) with the ‘–weir-fst-pop’ flag for all pairs of populations in HGDP, 1KG and PGR. PGR samples were grouped into subpopulations based on self-reported ethnicity. The same variant filtering as was applied for the admixture analysis was applied prior to calculating FST. All samples were used.

Principal component analysis

PCA was performed for PGR (n = 173,303) jointly with individuals from 1KG that were included in the admixture and FST analyses. The PGR and 1KG data were merged after filtering to a set of variants with MAF > 1% in the 1KG dataset as well as all PGR data batches and then linkage disequilibrium-pruned using the PLINK (v2.00a3.7) flag ‘–indep-pairwise’ with a window size of 200 kb and an r2 threshold of 0.5 (ref. 81). PCA was then performed on this pruned dataset using PC-AiR as implemented in the GENESIS package (v2.36.0)83. A second PCA to assess variation within South Asia was restricted to 1KG South Asian individuals and PGR participants with <2.5% predicted African admixture.

Runs of homozygosity

ROH were called using the BCFtools (v1.18) roh plugin with genotype likelihoods set to 30 (‘-G30’) and a genetic recombination map (‘–genetic-map’) based on the 1000 Genomes Project downloaded from https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz, for each data-sequencing tranche with tranche-specific allele frequencies84. For each sample, the overall level of autozygosity (FROH) was calculated as the overall fraction of the autosomal genome falling within a run-of-homozygosity. Regions upstream or downstream of the first or last covered bases respectively, and large intergenic regions >3 Mb, were excluded from the calculation of FROH. 1KG variant calls were subset to coding regions plus or minus 50 bases, and then ROH calling was performed with the same parameters as were used for PGR WES samples.

LoF variant filtering

To identify a set of high-confidence pLoF variants, we applied additional filters to frameshift, stop gain, splice acceptor, and splice donor variants in both PGR and gnomAD, which includes UK Biobank. As rare variants are more likely to have large impacts, we removed any variant with an allele frequency >1% in any PGR sequencing platform or with a gnomAD v4.1.0 group max allele frequency >1%. The group max allele frequency measures the highest allele frequency across non-bottlenecked populations in gnomAD. An important category of variants that may be misannotated as pLoF variants are frameshift and stop gain variants near the ends of transcripts, as these variants may escape nonsense-mediated decay while resulting in minimal impact on the overall protein sequence. We replaced the LOFTEE END_TRUNC filter with a filter that removes variants in the last 5% of transcripts, consistent with earlier implementations of LOFTEE (see Supplementary Fig. 2 and  Supplementary Information, ‘End-truncating pLoF variants’ for details).

Variants were then filtered if they met any of the following criteria:

  1. 1.

    Low complexity regions: in a low complexity region as defined by gnomAD.

  2. 2.

    NAGNAG splice sites: flagged by LOFTEE as a splice acceptor site that may be rescued by a nearby in-frame acceptor site.

  3. 3.

    Non-canonical splice sites: flagged by LOFTEE as falling in an intron with a non-canonical splice site.

  4. 4.

    First codon: falls in the first codon and may be rescued by a nearby downstream alternative methionine start codon.

  5. 5.

    Pangolin low scoring: the variant is a splice acceptor or donor loss and has a Pangolin22 score <0.6.

  6. 6.

    Pangolin rescue: the variant is a splice acceptor or donor loss and is predicted to result in a nearby compensatory in-frame acceptor–donor gain by Pangolin. Splice variants were considered rescued if the delta between the Pangolin score for the splice site loss and an in-frame alternative splice site gain was less than 0.4.

  7. 7.

    G>GT plus strand donor: the variant is a splice donor variant on the plus strand and inserts a T, retaining the GT donor site.

  8. 8.

    C>CT minus strand acceptor: the variant is a splice acceptor variant on the minus strand and inserts a T, retaining the CT (on the minus strand) acceptor site.

  9. 9.

    Plus strand exon–intron junction duplication: the variant is a duplication and is called a frameshift by VEP due to left alignment of variants. When the variant is right aligned, the actual effect of the variant is revealed to be a duplication of the intronic sequence.

  10. 10.

    Multi-nucleotide stop-gain rescue: the variant is called a stop-gain. When combined with a separate variant call in the same codon, the effect is either a missense or synonymous variant.

  11. 11.

    Frame-restoring insertion/deletion: the variant is called a frameshift, but a second frameshift called in the same exon restores the frame.

Filters 10 and 11 were only applied to PGR samples as these filters require individual-level data. As a result, per-sample and per-gene homLoF estimates in gnomAD may be inflated relative to PGR counts. Additionally, as the PGR WES data are unphased, filters 10 and 11 can only be applied to homozygous stop-gain and frameshift variants. Filters 10 and 11 were not applied for the recessive constraint calculations (described below), to avoid any bias that may result from applying a filter exclusively to homozygous variants.

PGR subsampling homLoF calculations

We assessed the homLoF gene discovery potential of PGR relative to each genetic group in gnomAD v4.1.0 when controlling for sample size. Specifically, for each gnomAD population, we randomly selected without replacement an equal number of PGR participants 10,000 times, and the mean and standard deviation of the number of distinct genes with at least one homLoF variant were calculated and compared to the gnomAD population. We similarly compared the homLoF gene discovery potential of each ethnic group relative to all other individuals in PGR. For these comparisons, samples were drawn from all PGR individuals not belonging to the ethnic group assessed. To assess whether the potential to discover new homLoF-carrying genes in PGR has saturated, we randomly subsampled PGR 25 times for sample sizes in increments of 100 from 0 samples to the current sample size of PGR (n = 173,303). The mean number of genes with at least 1, 2 or 10 homLoF variants was plotted.

Gene set homLoF and recessive constraint calculations

A set of genes determined to be essential in vitro was downloaded from table S2 of ref. 33. Mouse essential genes were downloaded from https://ftp.ebi.ac.uk/biostudies/fire/S-BSST/S-BSSTxxx293/S-BSST293/Files/SourceDataFile2_Embryo_windows.txt (ref. 34). A set of genes that are the targets of approved drugs was downloaded from https://github.com/ericminikel/genetic_support/blob/main/data/pp_launched_alltime.tsv (ref. 36). Grouping of drug targets by disease area was done by manual curation. ClinGen disease-associated genes were downloaded from https://search.clinicalgenome.org/kb/downloads on 14 June 2024 (ref. 35). ClinGen disease associations were filtered to those classified as ‘definitive’ or ‘strong’. Genes with disease annotations having autosomal dominant (AD) mode of inheritance (MOI) and no autosomal recessive (AR) MOI annotations were included in the ClinGen AD gene set. Likewise, genes with AR MOI annotations and no AD MOI annotations were included in the ClinGen AR gene set. Genes with both AR and AD MOIs were included in the ‘ClinGen Both’ set. The GSEA Hallmark gene sets were downloaded from https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp. Human Protein Atlas (HPA) v24.0 annotations were downloaded from https://www.proteinatlas.org/download/proteinatlas.tsv.zip, and the following RNA expression annotations were extracted: tissue specificity, tissue distribution, single cell-type specificity, and single cell-type distribution37. For each gene set, the homLoF OR was calculated as the odds of a gene in the gene set having at least one homLoF variant in PGR relative to a gene from the complement gene set. ORs and P values were calculated using methods from statsmodels stats.contingency_tables.Table2x2 (v0.13.0). The results include tests for 100 unique gene sets resulting in a Bonferroni multiple-testing P value threshold of 5 × 104. An additional analysis was performed using logistic regression with gene length as a covariate. Gene length was measured as the number of bps in the canonical transcript, excluding low complexity regions.

Two methods were used to assess exome-wide and gene set patterns of recessive constraint. We calculated depletion/enrichment of rare pLoF variants with at least one homozygous individual relative to frequency-matched synonymous variants. Specifically, all pLoF and synonymous variants were binned by allele count (AC) into the following bins: AC = 1, 2 ≤ AC ≤ 5, bins of width 5 for AC in [6, 500], bins of width 10 for AC in [501, 1,500], and bins of width 50 for AC > 1,500. Each pLoF variant was then randomly matched to a synonymous variant (drawn with replacement) in the same AC bin. When no synonymous variants were found in the same bin, a variant from the nearest bin(s) was randomly selected. We calculated the number of pLoF variants with at least one homozygote individual and measured a depletion/enrichment relative to the number of frequency-matched synonymous variants with at least one homozygote individual. This procedure was performed 100,000 times to establish a distribution of relative depletion/enrichment. Two one-sided P values for depletion and enrichment of pLoF variants were calculated as the fraction of times in which the number of frequency-matched synonymous variants with at least one homozygote was greater than or equal to the number of pLoF variants with at least one homozygote, or less than or equal to the number of pLoF variants with at least one homozygote. Conservative, two-sided P values were calculated as two times the minimum one-sided P value. The same procedure was used to define recessive constraint of other variant categories relative to synonymous variants. The primary results include tests for 100 unique gene sets from three different variant sets (high-confidence pLoF, damaging missense, all missense) resulting in a Bonferroni multiple-testing P value threshold of 1.67 × 104.

As a second test of recessive constraint, we calculated the ratio of pLoF alleles per autozygous base to pLoF alleles per non-autozygous base. Specifically, for gene \(i\) the following were calculated:

$${a}_{i}=\mathrm{autozygous}\,\mathrm{allele}\,\mathrm{count}$$

$${b}_{i}=\mathrm{nonautozygous}\,\mathrm{allele}\,\mathrm{count}$$

$${c}_{i}=\mathrm{autozygous}\,\mathrm{base}\,\mathrm{count}$$

$${d}_{i}=\mathrm{nonautozygous}\,\mathrm{base}\,\mathrm{count}$$

The AAR, Ri, and allele count, Ci, for gene i were then calculated as:

$${R}_{i}=({a}_{i}/{c}_{i})/({b}_{i}/{d}_{i})$$

$${C}_{i}={a}_{i}+{b}_{i}$$

Finally, for a gene set with \(n\) genes, the weighted mean AAR was calculated as:

$$\frac{{\sum }_{1}^{n}{R}_{i}{C}_{i}}{{\sum }_{1}^{n}{C}_{i}}$$

Permutation tests were used to identify gene sets that are depleted or enriched for pLoF alleles at autozygous positions relative to a typical gene set. Specifically, for a gene set of size n, 10 million sets of size n were constructed by sampling genes without replacement. AAR was calculated for each randomly sampled gene set, the absolute value of the difference between the sampled gene set AAR and the overall exome-wide AAR was calculated, and permutation test P values were established from this distribution. The same procedure was performed for missense variants, missense variants predicted to be damaging (REVEL40 score > 0.7), and synonymous variants. We note that a sum of ratios, even when weighting by counts, can be unstable when very small counts contribute to the individual ratios, and advise against performing these calculations on smaller datasets with very sparse allele counts. All calculations of recessive constraint were limited to autosomal chromosomes, with pLoF variants filtered as described above.

Associations with F
ROH

All analyses of quantitative traits or binary outcomes were performed with age, age-squared, gender, sequencing platform and the first ten genetic principal components as covariates. Analyses were further adjusted for self-reported years of formal education and self-reported tobacco usage (‘ever’ versus ‘never’) where specified. Cochran’s Q test was used to calculate P values assessing heterogeneity of effect size by ethnicity, sex and age bin. Sensitivity analyses restricted to highly consanguineous individuals included individuals with FROH > 0.03 and self-reporting first-cousin parents. Quantitative phenotypes were transformed using a rank-based inverse standard normal transformation. Effect sizes (β) for quantitative traits are reported in standard deviations of the normalized trait value per 0.0625 FROH (the theoretical expectation of FROH for the offspring of first-cousin parents). ORs for binary outcomes are shown per 0.0625 FROH. All analysis was performed using the statsmodels Python module (v0.13.0).

Phenotype definitions

Diabetes cases were defined as individuals self-reporting diabetes, having an HbA1c ≥ 6.5%, or using oral hypoglycaemics. Diabetes controls were ascertained as individuals with no self-reported history of diabetes, HbA1c < 6.5% when available, and random blood glucose concentrations <150 mg dl−1 when available. Individuals not reporting diabetes but having random blood glucose concentrations ≥150 mg dl−1 were excluded from the diabetes analysis. Myocardial infarction cases were enrolled at the time of event as described4. Stroke cases were also enrolled at the time of event. Angina and atrial fibrillation or irregular heartbeats were both self-reported. Hypertension cases were defined as individuals self-reporting with hypertension. Hypertension controls were defined as individuals not reporting hypertension and not having diastolic blood pressure ≥90 mmHg or systolic blood pressure ≥140 mmHg. LDL-C was calculated using the Friedwald equation85. eGFR was calculated using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation86. The LDL-C, total cholesterol, and triglycerides analyses were subset to individuals who were not on cholesterol lowering drugs. Glucose concentrations were analysed for participants who were not on oral antidiabetic drugs. Diastolic and systolic blood pressure analyses were limited to individuals not on antihypertensives.

Burden analyses

All burden analyses were run using regenie. Batches sequenced at the Broad Institute were analysed using Regenie87 v3.1.3. Batches sequenced at the RGC were analysed using regenie v3.1.1 on the DNAnexus Apollo platform. Results from each sequencing platform were meta-analysed using inverse variance weighted meta-analysis as implemented in METAL (2011-03-25 release)88. All quantitative traits were transformed by the rank-based inverse standard normal function, applied within each sequencing batch by Regenie. Quantitative traits were analysed using both additive and recessive linear regression models. Binary traits were analysed with both additive and recessive logistic regression models with Firth fallback. All analyses were adjusted for age, sex, age × sex, age-squared and the first ten genetic principal components. Genetic principal components were calculated within each sequencing platform using PC-AiR in the same manner as described above. We performed two sets of burden analyses for each gene for a set of 46 biomarkers. Mask1 included pLoF variants. Mask2 included pLoF variants plus variants predicted to be damaging (REVEL score > 0.7). For each mask, only genes with >10 qualifying variants were considered. A Bonferroni multiple-testing correction was calculated based on the total number of unique tests across all genes and phenotypes, only counting once any gene for which the mask1 and mask2 variant sets were identical. A total of 506,644 tests were performed, giving a Bonferroni multiple test correction threshold of P < 9.9 × 10−8.

RXFP1 expression characterization

cDNA for the reference allele (‘wild-type’) RXFP1 (NM_021634.4; NP_067647) and variants of interest were synthesized and subcloned into the mammalian expression vector pcDNA3.1(+) with an N-terminal Flag tag at GeneArt (ThermoFisher). For each construct tested, 2E6 HEK293 cells were transfected using 2 μg DNA and 4 μl Lipofectamine 2000 (11668019, ThermoFisher). Cells were cultured in complete medium (DMEM + 10% fetal bovine serum) for 48 h before collection for expression profiling. Before collection, cells were washed with ice cold PBS and processed using a membrane extraction kit as per manufacturer’s protocol (78840, Thermo Fisher). Total protein was quantified using the bicinchonic acid assay (23225, Thermo Fisher) and 20 μg of total protein was separated on an 4–12% SDS-PAGE gel. Gels were transferred to PVDF membrane by overnight transfer and RXFP1 expression was detected by a rabbit anti-RXFP1 polyclonal antibody (H00059350, Abnova) and a goat anti-rabbit horseradish peroxidase secondary antibody (611-1302, Rockland Immunochemicals) via chemiluminescent detection using KPL lumi-GLO substrate (5430-0042, SeraCare). FreeStyle 293-F cells were obtained from Thermofisher (https://www.thermofisher.com/order/catalog/product/R79007?SID=srch-srp-R79007). Expi293F cells were obtained from ThermoFisher1081 (https://www.thermofisher.com/order/catalog/product/A14527?SID=srch-srp-A14527). ThermoFisher has generated information on the characterization and authentication of the cells (FreeStyle 293-F cells, https://www.thermofisher.com/order/catalog/product/R79007?SID=srch-srp-R79007; Expi23F cells, https://www.thermofisher.com/order/catalog/product/A14527?SID=srch-srp-A14527). These cells were used in this study because they are commonly used for receptor characterization. Cells were negative for mycoplasma contamination.

cAMP functional assay for RXFP1

Constructs were transfected as outlined above. After 48 h, cells were dissociated with TriplE (12604013, Thermo Fisher) and seeded into 384-well plates in serum-free FreeStyle media (12338018, Thermo Fisher) followed by 3 min centrifugation at 500g. After a 4 h incubation at 37 °C, FreeStyle medium containing 3-isobutyl-1-methylxanthine (IBMX; I5879, Sigma) was added for a final concentration of 1 mM IBMX and cells incubated at 37 °C for 30 min. Recombinant relaxin-2 (custom synthesis request, Phoenix Pharmaceuticals) was then added at different concentrations to establish a full-dose response curve and cells were incubated for 1 h at 37 °C. cAMP was then measured using a Homogeneous Time-resolved Fluorescence kit as per the manufacturer’s instructions (62AM4PEB, CisBio).

RBG studies

Probands were recontacted by the Center of Non-Communicable Diseases in Karachi Pakistan under the IRB approval of IRB committee of Center for Non-Communicable Diseases (NIH-registered IRB 00007048). After obtaining consent from the proband and from the family members, questionnaires regarding past medical and family history were administered by trained research staff, in the local language. Recruitment proceeded until all potentially recontactable individuals either consented or declined to participate. Sex was self-reported. Physical measurements such as height, weight, waist and hip circumference were taken in standing position by using height and weight scales and a spring gauge tape measure. Blood pressure and heart rate were recorded by using OMRON healthcare M2 blood pressure monitors. Echocardiograms (ECHO) were obtained using the Vivid-i cardiovascular ultrasound system (GE HealthCare). Cold pressor tests were conducted as previously described89. The participant’s non-dominant hand was immersed in a bucket of ice water (4 °C) for up to 150 s. The time at which the participant first starts to experience pain (cold pain threshold) was recorded as was the time at which the participant pulls their arm out of the water (cold tolerance). For individuals who did not report experiencing pain, their cold pain threshold was set to 150 s for statistical analysis.

Non-fasting blood samples were collected from each participant in EDTA and Gel Tubes. Serum and plasma were separated within 45 min of venipuncture. A random urine sample was also collected from each participant. The samples were stored temporarily in dry ice in the field and transported to a central laboratory based at CNCD and stored at −80 °C. Measurements for total cholesterol, HDL-C, LDL-C, triglycerides, very-low-density lipoprotein, AST, ALT and creatinine were made in serum samples using enzymatic assays, whereas concentrations of HbA1c were measured using a turbidimetric assay in whole-blood samples (Roche Diagnostics). Urinary microalbumin was also measured through the urine sample.

Isolation of genomic DNA

DNA was extracted from leukocytes of peripheral whole blood using a reference salting out method. DNA concentrations were determined by UV based quantification by using Nanodrop 2000 Spectrophotometer (Thermo Scientific).

Sanger sequencing from RBG studies

Whole-blood-derived DNA from recalled participants was used to assess carrier status and zygosity for variants of interest of the probands and family members via Sanger sequencing. Sanger sequencing was conducted at Macrogen or in-house at the CNCD. For Macrogen-processed samples, DNA samples were shipped directly to Macrogen for PCR amplification and Sanger sequencing. PCR primers were designed covering a region of approximately 200 to 300 bases around the variant. For in-house Sanger sequencing, specific primers were designed to amplify the region of interest using Platinum Master Mix (Thermo Scientific). This amplified DNA product was cleaned up using ExoSAP-IT Express PCR Product Cleanup (Thermo Scientific). The DNA was then used in BigDye Terminator v3.1 cycle sequencing following addition of BigDye XTerminator (Thermo Scientific) for cleanup and run on Applied Biosystems SeqStudio Genetic Analyzer (Thermo Scientific). Manufacturers’ protocols were followed for all kits.

RBG PheWAS

Continuous traits were analysed using linear mixed effects model with age and sex as fixed effects, and family ID as random effect. Binary traits were analysed using logistic regression with age and sex as exogenous variables. Multinomial traits were binarized prior to logistic regression analysis. Only additive linear mixed effects model was considered due to the limited sample size of individuals with homLoF. Continuous and categorical traits were computed from the ECHO recordings by the cardiac ultrasound device and used to perform PheWAS in recalled individuals. All associations were performed using the additive linear mixed effects model. Regression models were implemented using Python statsmodels v0.14.4.

Data sources

1KG variant calls and sample manifest were downloaded from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ and https://www.internationalgenome.org/data-portal/sample, respectively. HGDP variant calls were downloaded from https://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/. A recombination map used for ROH calling was downloaded from https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz. Mouse essential genes were downloaded from https://ftp.ebi.ac.uk/biostudies/fire/S-BSST/S-BSSTxxx293/S-BSST293/Files/SourceDataFile2_Embryo_windows.txt. A set of genes that are the targets of approved drugs was downloaded from https://github.com/ericminikel/genetic_support/blob/main/data/pp_launched_alltime.tsv. ClinGen disease-associated genes were downloaded from https://search.clinicalgenome.org/kb/downloads on 14 June 2024. The GSEA Hallmark gene sets were downloaded from https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp. Human Protein Atlas (HPA) v24.0 annotations were downloaded from https://www.proteinatlas.org/download/proteinatlas.tsv.zip.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

RELATED ARTICLES

Most Popular

Recent Comments