Thursday, February 13, 2025
No menu items!
HomeNatureGenetic architecture in Greenland is shaped by demography, structure and selection

Genetic architecture in Greenland is shaped by demography, structure and selection

Inclusion and ethics statement

The study was approved by the Scientific Ethics Committee in Greenland (project 505-42, 505-95, project 2011–13 (ref. no. 2011–056978), project 2014-08 (ref. no. 2014-098017), project 2017-5582, project 2015–22 (ref. no. 2015-16426) and project 2021-09) and was conducted in accordance with the Declaration of Helsinki, second revision. All participants gave written informed consent.

User studies were conducted as public meetings, and in shared circles in relation to the population studies in Greenland and in relation to genotype-based intervention studies in CSID and among carriers of the TBC1D4 variant45,46. In brief, the conclusions from these user studies indicate that research on chronic diseases is considered important and necessary. This includes understanding how heredity and lifestyle interact. There is a demand for research on the significance of traditional lifestyle for health, including the importance of traditional practices in treatment. Additionally, it is crucial to explore areas where no treatment currently exists. Research serves as a prerequisite for discovering new methods of prevention and treatment, and explanations can help reduce anxiety associated with diseases and symptoms. Finally, there is broad support for investigating hereditary causes of health and disease. Users express that identifying known genetic causes of illness can enable early prevention and treatment, especially for family members. Lack of representation is considered a significant issue.

Participants and representative sampling

We included Greenlanders living in Greenland from the population health surveys B99 (ref. 47) (recruited 1999–2001; n = 1,317), Inuit Health in Transition48 (IHIT; recruited 2005–2010; n = 3,108), and B2018 (refs. 49,50) (recruited 2017–2019; n = 2,539). We also included a cohort of Greenlanders living in Denmark, collected as part of the B99 survey (BBH; n = 739). Some participants were part of several surveys, resulting in a total of 5,996 unique participants across the different cohorts. As previously described47,48,49, participants were recruited in representative towns and settlements of each region in Greenland where random samples were drawn from the central population registers.

Genetic data

The WGS data were handled as described in ref. 14, where it was used for screening of new variants in 14 genes related to maturity-onset diabetes in the young. In brief, the 448 WGS samples were sequenced using Illumina 150 paired-end sequencing and had an average sequencing depth of 35×. After adaptor trimming, reads were mapped with BWA-MEM to GRCh38, and genotype calling was carried out with GATK haplotype caller. Only variants in the T98 tranche from variant quality score recalibration were used. Moreover, we used genotype data on an additional 5,548 people generated from the Multi-Ethnic Global Array (MEGA-chip, Illumina); of these, 4,182 were described previously in ref. 20 (deposited with European Genome-phenome Archive, accession no. EGAD00010002057) and the remaining 1,366 were part of the newer B2018 study. The two datasets were merged and, after variant quality control, the data consisted of 1.6 million autosomal variants.

Phasing and imputation

To prepare a good reference panel for the downstream imputation analyses, we phased the WGS data using Shapeit2 (ref. 51) with trio and duo information. We call this the Greenlandic WGS panel and it includes all sites found to be polymorphic in the Greenlandic WGS data. Moreover, we prepared another reference panel consisting of all participants in the 1KG Project, which we called the 1KG panel. For this panel we included all overlapping sites with the Greenlandic panel and all sites with a MAF > 1% in the populations with East Asian and European ancestry (CDX, CEU, CHB, CHS, GBR, TSI and IBS52). We then imputed the MEGA-chip data with IMPUTE2 (refs. 51,53) with option ‘-merge_ref_panels’, which can leverage both prepared reference panels to improve imputation performance. Finally, we created two datasets with and without phasing information by merging the imputed (and phased) MEGA data with the phased WGS data, which resulted in 5,996 participants. For downstream analyses, we did quality control on variants with respect to genotype missingness and allele frequency, as noted in each analysis. Additionally, to investigate the improvement of imputation accuracy using the Greenlandic WGS panel, we also ran IMPUTE v.2 with only the 1KG reference panel for comparison and showed only imputation performance on the overlapping sites.

Variant annotation

All variants were annotated using VEP54 with the following additional custom annotations: dbSNP (build 155), gnomAD v.3.0.0 allele frequencies and genome coverage (gnomADover15: proportion of gnomAD participants with depth > 15, gnomADover50: proportion of gnomAD participants with depth > 50, gnomAD filter: the gnomAD filter column being either PASS, AC0, AS_VQSR, InbreedingCoeff or NA if not polymorphic in gnomAD), 1KG frequencies (v.20201028_3202_phased)52, ancestral allele from the Ensembl Homo sapiens ancestral sequence, and Loss-Of-Function Transcript Effect Estimator55. We did not use GnomAD v.4 because it did not have information about depth and filters for non-called sites.

To find new variants, we kept only variants with good coverage in gnomAD (gnomADover15 > 80%, gnomADover50 < 10% and gnomAD filter = PASS or NA). Moreover, we did not allow for spanning deletions (variant call format (VCF) asterisk allele), variants with missingness greater than 20% or several variants 1 bp apart in the Greenlandic WGS data. All these filters were used to minimize the number of false-positive variants. For plotting, variants were polarized according to the minor allele in gnomAD African populations.

For comparing the number of SNPs not found in other populations we used only SNPs with good coverage in gnomAD (as defined above) and found the maximum allele frequency in any of the following gnomAD populations: AFR, NFE, FIN, SAS, EAS and ASJ, covering African, European and Asian populations. SNPs with a maximum gnomAD frequency more than 0.01%, were excluded and the remaining SNPs were counted as SNPs not in Africa, Europe or Asia for both the 448 admixed Greenlandic WGS samples and a random sample of 448 samples from the Americas in the 1KG PEL, CLM, MXL and PUR populations.

Measurements and assays

Height, weight, systolic blood pressure, diastolic blood pressure, and hip and waist circumference were measured, and body mass index and waist-hip ratio were calculated. All IHIT participants above 18 years, B99 participants above 35 years and a subset of B2018 participants underwent an oral glucose tolerance test, where blood samples were drawn after an overnight fast of at least 8 h, and at 30 min (only for B2018) and 2 h after receiving 75 g glucose. Plasma glucose was measured at fasting, 30 min and 2 h, and haemoglobin 1Ac at fasting, as previously described56. Concentrations of serum cholesterol, high-density lipoprotein cholesterol and triglycerides were measured, and LDL-cholesterol calculated. Type 2 diabetes was defined based on the World Health Organization 1999 criteria57 and controls were defined as normal glucose tolerant based on the oral glucose tolerance test data.

The Olink protein data for the Greenlandic participants used for protein quantitative trait loci analysis are from ref. 31. Using the Olink Target 96 Inflammation and Cardiovascular II panels, relative plasma levels of 184 proteins were measured in 3,732 participants across the population surveys. The 2 batches were bridged and normalized based on 16 control samples using the OlinkAnalyze R package (https://cran.r-project.org/web/packages/OlinkAnalyze/index.html). Normalized protein expression values on a log2 scale were inverse-rank normalized, including normalized protein expression data below the limit of detection. Samples with a quality control warning were excluded.

The Danish data used in assessment of polygenic scores consist of 6,182 people from the Danish population-based Inter99 cohort, where metabolic phenotypes were measured as described previously58, and genotyping was done using the Infinium OmniExpress-24 v.1.3 Chip (Illumina Inc.). Genotypes were called using GenCall in GenomeStudio (v.2011.1; Illumina) and quality control was performed according to a standard procedure as previously described59. Data were imputed using the Haplotype Reference Consortium reference panel on the Michigan imputation server.

Admixture, haplotype masking and fine structure

Inuit and European admixture proportions were calculated using the software ADMIXTURE60 on a subset of variants with MAF > 5%, missingness less than 1% and LD-pruned within 1 Mb removing variants with R2 > 0.8 using Plink v.1.9.0 (ref. 61).

For fine structure analysis of the Inuit ancestry, we used the neural network framework, HaploNet33, on the phased and imputed data of all 5,996 Greenlandic participants on SNPs with MAF > 5%. First, we performed window-based haplotype clustering using a Gaussian mixture variational autoencoder. A window size of 1,024 SNPs was used to generate haplotype cluster likelihoods for all samples, which we leveraged to infer fine-scale population structure through both ancestry estimation and principal component analysis. We performed unsupervised ancestry estimation allowing for two ancestral sources (K = 2) and ran it with several seeds to ensure that the expectation maximization algorithm of HaploNet had converged. The convergence criterion was defined as having two runs within five log-likelihood units of the best seed. The two ancestry sources were assumed to reflect Inuit and European ancestry.

Next, we used HaploNet Fatash to infer the local ancestry of the haplotypes. Fatash obtains posterior probabilities of the local ancestry per genomic window per haplotype based on haplotype likelihoods and genome-wide admixture proportion estimates obtained from HaploNet train and HaploNet admix. The model is based on a hidden Markov model with instantaneous rate change. Fatash is implemented in Python/Cython as a submodule in the HaploNet software suite. Local ancestry tracts were inferred using three different window sizes (1,024; 512 and 256 variants) to increase accuracy proximate to recombination events. We used only the smaller windows (512 or 256 variants) if the fit was more than 50 log-likelihood units better to balance overfitting and detection of true recombination events.

Finally, we masked the haplotype clusters in genomic windows for each haplotype with posterior probability less than 0.95 for Inuit ancestry based on the local ancestry inferred by Fatash. After genomic masking, we excluded people with less than 0.90 missingness and genomic windows with haplotype frequencies less than 0.01 after masking from downstream analyses. The masked dataset was used for fine structure analyses of the Inuit ancestry using HaploNet PCA and HaploNet Admix. For HaploNet Admix, we used the same convergence criterion as described above and modelled from two to eight ancestry sources. The admixture model with nine ancestry sources did not fulfil our convergence criteria.

Local ancestry and homozygous ancestry masking of WGS data

From the imputed and phased data, we first excluded variants with missingness less than 1% and MAF < 5%. Then, we inferred local ancestry for the admixed participants using RFmix v.2 with unadmixed Greenlandic Inuit (N = 99) and participants with European genetic ancestry (N = 313, CEU, IBS and TSI) from the 1KG project as reference populations. This gives us inference of local ancestry tracts for each person. To analyse the genetic architecture of the Greenlandic Inuit, we constructed a masked WGS dataset, using vcfppR62, where regions with any European local ancestry were excluded resulting in an average of 240 (minimum, 201; maximum, 285) Greenlandic Inuit participants at each site. To create the masked dataset, for each person we only keep sites in a local ancestry region inferred to be of Inuit ancestry on both alleles. In this way we can also assess rare variants without relying on correct phasing. We manually excluded the HLA region (Chr. 9: 28510120–33480577), local ancestry tracts with fewer than 200 variants per megabase and local ancestry tracts with extreme inferred mean-Inuit ancestry (less than 63% or more than 71%) as they are all potentially problematic regions in terms of inferring local ancestry accurately.

Site frequency spectra

Reference and alternative allele counts were counted using Plink v.1.9.0 (ref. 61) keeping allele order and projected to the wanted number of participants using the formula binom(m,j) × binom(n − m, k − j)/binom(n,k), where k is the observed number of alternative alleles, n is the number of total alleles, m is the number of alleles to project to (that is, two times the number of participants) and j is the site frequency spectra (SFS)-bin. For each site we get the probability that we would observe j alternative alleles in a subsample of m alleles. The probabilities were summed across variants and folded to get the folded SFS.

The derived sum was calculated based on a SFS polarized by ancestral or derived allele only using SNPs with a high-confidence ancestral allele match. Derived alleles were counted and projected to the needed number of participants as above SFS, but not folded.

To measure the number of segregating SNPs as a function of the number of participants sequenced, we projected the SFS to the wanted number of participants, folded the SFS and summed across all the non-zero SFS-bins. In this way, we get the number of segregating SNPs for all possible subsamples of participants from our data.

Clinical screening

Only variants with good coverage in gnomAD v.3.0.0 as described above and predicted to be LoF on the canonical transcript were used. For each participant, variants with a MAF in any gnomAD population above 0.1% were excluded. For Greenlandic people, an additional count was made where the variants could be excluded based on either MAF from any gnomAD population or MAF from the Greenlanders excluding itself. Figure 2b shows the results for different MAF thresholds. Numbers of people in genetic ancestry groups were as follows: Greenlandic, 448; Greenlandic (unadmixed), 31; Nigerian (YRI + ESN), 207; Han Chinese (CHB + CHS), 208 and British (+CEU) (GBR + CEU), 190.

Gene burden

A total of 190 participants with the least European ancestry were sampled from the Greenlandic WGS data (on average 12% and at most 20% European ancestry) and 190 participants were sampled randomly from the East Asian populations, CHB and CHS. The European GBR + CEU group had a total of 190 unrelated samples. All variants were annotated with an allele frequency based on the African populations from 1KG without European admixture (LWK, ESN, YRI, MSL and GWD)52. Only SNPs with good coverage in gnomAD as defined above, predicted to be missense or LoF variants in the canonical transcript of a constrained gene, and with an African MAF < 0.01% were kept. Constrained genes were defined to be genes with expected number of pLoF and LOEUF score estimated by Karczewski et al.55 being less than 10 and 1, respectively. This resulted in 9,533 constrained genes from canonical transcripts. In each gene, two burdens were calculated per person: (1) the gene burden, which is 0 if no SNPs were present and 1 if at least one SNP was present and (2) the most common burden, which is 1 if the person carries the most common variant in the gene and 0 otherwise. If the most common burden was less than 50% of the gene burden, we categorize the gene burden as informative and if the most common burden was at least 90% of the gene burden, we categorize the gene as one variant dominates. All calculations were done independently in each of the three populations.

LD decay and tag-SNPs

We randomly sampled 190 unadmixed (inferred European ancestry less than 1%) and unrelated Greenlanders. From these, we calculated pairwise LD using Plink v.1.9.0 in a 10-Mb region. The average number of tag-SNPs was calculated by counting the number variants that were in high LD (R2 > 0.8) in regions of varying sizes.

Genome-wide association studies

Association tests were run using a linear mixed model, with the estimated genetic relationship matrix (GRM) as a random effect, taking population structure and relatedness into account, using GEMMA v.0.98.5 (ref. 63). The GRM was estimated from a set of ‘good’ variants with MAF > 5%, missing less than 1%, INFO > 0.95, and in Hardy–Weinberg equilibrium taking population structure into account (PCAngsd, excluding sites with |SiteF| less than 0.05 and P < 1 × 10−6). We used a leave one chromosome out scheme where the GRM used for testing associations on a given chromosome was estimated using all the other chromosomes. All phenotypes were tested using a score-test after sex-stratified rank-based inverse normal transformation and including age, sex and cohort as covariates. As in previous studies14, the odds ratio for diabetes was calculated using the logistic mixed model implemented in GMMAT64. For each phenotype, the GRMs were estimated only for people with no missing phenotype or covariate information. To identify independent association signals in the Greenlandic cohorts, we first calculated the LD-adjust65 correlation, R2 between pairwise SNPs accounting for the population structure. Then, we performed LD clumping on the above association signals using PCAone v.0.4.4 (ref. 66) adjusted for five principal components with the options ‘–clump-p1 1e-6 –clump-p2 1e-6 –clump-r2 0.001 –clump-bp 10000000 –pc 5’.

European independent genome-wide significant (P < 5 × 10−8) signals within 10 Mb were extracted from summary statistics for the 13 metabolic traits using the extract_instruments function with default parameters from the R package TwoSampleMR v.0.5.10 (ref. 67) (Supplementary Table 10). The independent signals of type 2 diabetes were extracted from the curated list of Mahajan et al.68.

For the sample-size-matched GWAS with UK Biobank, we randomly sampled 5,996 people with no missingness on the phenotypes. For associations on protein abundances, we randomly sampled 3,707 of the 5,996, matching the number of people in the Greenlandic cohorts with protein abundances. The GRMs on the UK Biobank were estimated on variants with MAF > 5% and missing less than 1%, resulting in a total of 4.5 million variants. For all protein abundance associations, we extracted the variant with the lowest P value, removed all variants within ±2.5 Mb of that variant and repeated until no variants were left with a P value < 5 × 10−8. This was done independently for the additive and recessive model, but assigned to the model with the lowest P value. To avoid duplicated signals of strong associations that were found in both models, we excluded associations within ±1 Mb of an association with a lower P value.

For continuous traits under the additive model, the variance explained was calculated as PVE = β2add 2AF(1 − AF), where βadd is the additive effect size and AF is the allele frequency of the effect allele69. For the recessive model variance explained was calculated as PVE = β2rec Fhom(1 − Fhom), where βrec is the recessive effect size and Fhom is the expected homozygous frequency Fhom = AF2. This method was used for plotting as we could calculate variance explained CIs using the same formulas but with the lower and upper CI for the effect size. For choosing the variants with more than 1% variance explained we used the formula PVE = β2/(β2 + SE(β)2 × N), where SE(β) is the standard error of β and N is the number of participants70. For binary traits we calculated the liability-scale variance explained using the R package Mangrove (v.1.21) as previously described14.

Polygenic scores

PGS files from Weissbrod et al.71 were downloaded from the PGS catalogue (Supplementary Table 11). All variants were lifted over from Hg19 to Hg38 using the UCSC Liftover command line interface while keeping track of strand-flipping. For both the Greenlandic, Danish and UK Biobank (non-British Europeans) datasets, overlapping variants were identified by matching on chromosome, position and effect, and other allele on both alternative and reference allele. The PGS was calculated on the genotype dosages using the score function in Plink2 v.2.0.0 (ref. 61). Each phenotype was rank-based inverse normal transformed separately for males and females. Next, two linear models were fitted: (1) Null model, phenotype is approximately age × sex + PC1–10 and (2) PGS model, phenotype is approximately age × sex + PC1–10 + PGS. Then, we calculated the incr. R2 = adjusted R2 (PGS model) − Aajusted R2 (Null model). For the Greenlandic cohorts, an additional covariate for cohort was included and for the UK Biobank an additional covariate for assessment centre was included. Body mass index was included as a covariate for waist:hip ratio. The CIs for the incr. R2 were calculated using the Olkin and Finn’s approximation for s.e. To summarize across all phenotypes, we calculated the relative incr. R2 using the UK Biobank as baseline. The increased performance on incr. R2 on LDL-cholesterol for the Danish cohort was probably due to differences in age distributions (age range, UK Biobank, 46–80 years; Danish, 30–60 years). The population-based Greenlandic cohort design followed the population-based design of the Danish Inter99. The UK Biobank had a completely different cohort design and is not completely compatible; for example, mean age in UK Biobank is 64 years, whereas mean age in the Danish and Greenlandic cohort is 46 and 45 years, respectively.

Relatedness analysis and relation graph

As described previously72, we estimated relatedness using a filtered set of genetic variants with MAF > 5%, missingness < 5% and LD-pruned (Plink v.1.9.0 indep-pairwise 1,000 kb 1 0.8) along with the inferred admixture proportions as input to NGSremix73. For each pair, NGSremix calculates pairwise relatedness as the fraction of loci sharing zero, one or two alleles identical by descent (represented by k0, k1 and k2, respectively). Parent–offspring pairs were defined as relationships with k1 + k2 > 0.95 and k1 > 0.75 and the parent was inferred using the age of the participants. Full sibling pairs were defined as relationships with 0.3 < k1 < 0.7 and k2 > 0.1. Out of 5,828 people with age and location, we identified 1,727 parent–offspring and 1,841 full sibling relationships. Relationships were normalized to the number of possible pairs. For relationships between regions, for example, region 1 and 2, the number of possible pairs was calculated as npossible(1,2) = nregion1 × nregion2, where nregion1 and nregion2 is the number of participants in region 1 and 2, respectively. Within region, the number of possible pairs of parent–offspring relationships was calculated as npossible(1,1) = nregion1 × (nregion1 − 1). Within region, the number of possible pairs of full sibling relationships was calculated as npossible(1,1) = (nregion1 × (nregion1 − 1))/2.

Expected homozygous frequency

To calculate the expected frequency of homozygous carriers with the current fine structure, fhom(structure), we estimated the regional allele frequencies, AFregion, based on sample location, calculated the expected number of homozygous carriers in each region as nhom(region) = nregion × AF2region, calculated the total sum of homozygous carriers, nhom = nhom(region1) + nhom(region2) + ··· + nhom(region8) and divided with the total number of participants fhom(structure) = nhom/ntotal. The expected frequency of homozygous carriers as a panmictic population, fhom(panmictic), was estimated as fhom(panmictic) = AF2. CIs of the homozygous frequencies were estimated as the s.d. of the frequency estimates from 10,000 bootstrap samples.

Estimation of allele frequencies in Greenland and in other datasets

We estimated frequencies for different variants in the Greenlandic Inuit population using the masked participants from this study. Moreover, we surveyed the allele frequencies for those variants in a range of available datasets12,17,26,29,30,74,75,76,77,78,79,80,81 from across the world (Supplementary Tables 7 and 8). We used SAMtools82 and BGT83 to extract the allele counts of the relevant variants from those datasets.

Ancestral recombination graph

The ARG was estimated in two versions: (1) using all 448 WGS Greenlanders (used for analysis of FADS2 and CPT1A) and (2) dividing the genome into 6-Mb chunks, finding people with only Inuit ancestry in that chunk and randomly subsampling 150 of those people, resulting in a Greenlandic Inuit masked ARG from 150 mosaic participants (used for all other analyses). For the masked version, only chunks with more than one variant per 5,000 bp were used. Both ARGs were estimated using the default Relate v.1.2.1 pipeline; for each chromosome (or chunk, in the masked version), convert phased VCF to haps/sample file format, prepare input using the provided PrepareInputFiles script, with the high-confidence Ensembl H.sapiens ancestral sequence as ancestral reference and estimate the ARG for the chromosome/chunk using the provided RelateParallel.sh script with Effective population size of haplotypes set to 2,000 and mutation rate per generation to 2.5 × 10−8. The overall population size was then estimated using the provided EstimatePopulationSize script on all chromosomes/chunks, resulting in the estimated population size history.

To estimate variant age, we made 10,000 branch length samples of the local tree from the estimated ARG using the provided SampleBranchLengths script from Relate. Note that Relate does not allow for changes in the tree topology when resampling. From each sample, we extracted the time to the most recent common ancestor for the variant as well as the preceding ancestor. This yields a minimum and maximum age of the variant measured in generations for each branch length sample. From the intervals, we estimated the age of the variants and the credibility interval. We calculated a probability density as the weighted average of the intervals where the weight is the reciprocal of each interval length. By doing so we assumed that the age is equally likely to lie anywhere within each interval and we gave equal weight to each of the 10,000 sampled branch lengths. The variant estimate was estimated to be the median of the probability density and the 95% credible interval was calculated as 2.5% and 97.5% quantiles. The age in generations was converted to years by multiplying with the assumed generation time of 28 years per generation.

Selection

As with variant age estimates above, we tested for selection using branch length samples of the local tree from the estimated ARG. These branch length samples were used as input to CLUES to infer allele frequency trajectories and test for selection44. To obtain empirical P values, we tested for selection for 999 additional variants matched to have within ±10% derived allele frequency of each variant. The empirical P value was then calculated as log-likelihood rank divided by the total number of variants tested. Random variants were not sampled within ±5 Mb of any of the tested variants.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

RELATED ARTICLES

Most Popular

Recent Comments