EBV strain interacts with host HLA to drive nasopharyngeal carcinoma risk

Study participants

Participants from southern China were enrolled through two independent recruitments (sample sets 1 and 2). Participants in sample set 1 were participants in a population-based case–control study conducted in NPC-endemic regions of southern China (Guangdong and Guangxi provinces) between 2010 and 2014. The study design was previously described in detail⁵⁹. In brief, 2,554 treatment-naive patients histologically confirmed with NPC were identified through a rapid case ascertainment system involving a network of local physicians. For the population-based control recruitment, 2,648 healthy control individuals were frequency-matched to cases by sex, 5-year age group and residential area, and were randomly selected from local population registries. Saliva DNA samples were available for 1,202 cases and 1,780 controls. In sample set 1, 747 patients with NPC and 1,251 healthy controls with age, sex, and available EBV genotype and host genome-wide genotyping were included in the discovery phase of the host genome-wide scan (Extended Data Fig. 1, Supplementary Table 1 and Supplementary Fig. 2). In sample set 2, an independent NPC case–control study was recruited from the same Chinese population in southern China. This study consisted of 883 cases and 1,537 controls of self-reported Chinese ancestry recruited between 2013 and 2022. For sample set 2, 644 patients with NPC and 880 healthy participants with age, sex, and available EBV genotype and host genome-wide genotyping were included in the validation phase of host genome-wide interaction study (Extended Data Fig. 1, Supplementary Table 1 and Supplementary Fig. 2). All study participants were recruited irrespective of EBV strain status, and no selection was performed on the basis of infection with any specific EBV subtypes.

The Singapore dataset was completely independently recruited in Singapore, comprising 226 cases with NPC and 209 controls. Among these, 223 cases with NPC and 204 healthy controls with available EBV genotype and HLA allele data were included in the replication of EBV subtype–host interaction analysis (Extended Data Fig. 1 and Supplementary Fig. 4).

Furthermore, to capture EBV genomic diversity across China for viral population genetic and phylogenetic analysis, we enrolled 163 healthy participants from different regions of China, including northern China (Shandong, Shanxi, Inner Mongolia, Xinjiang, Hebei and Liaoning), eastern China (Fujian) and southwestern China (Sichuan). Among them, 113 EBV whole-genome sequences passed quality control (see details below) and were included in subsequent EBV phylogenetic analyses (Extended Data Fig. 1, Supplementary Table 16 and Supplementary Fig. 7).

This study was approved by the institutional ethics committees of the Sun Yat-sen University Cancer Center, Guangzhou, China and the Genome Institute of Singapore, A*STAR, Singapore. Written informed consent was obtained from all participants.

Sample collection and processing

Saliva samples were collected into vials containing an equal volume of prepared lysis buffer (50 mM Tris, pH 8.0, 50 mM EDTA, 50 mM sucrose, 100 mM NaCl and 1% SDS) and stored at −80 °C. DNA was extracted from saliva using either the Chemagic STAR (Hamilton Robotics) or the QIAamp DNA Blood Midi Kit (Qiagen) according to the manufacturer’s instructions. The extracted DNA was subsequently used for human and EBV genotyping, as well as EBV whole-genome sequencing.

Peripheral blood samples were collected and processed within 24 h. Human peripheral blood mononuclear cells (PBMCs) were isolated by Ficoll density gradient centrifugation, cryopreserved in liquid nitrogen and subsequently used for T cell functional assays.

Data acquisition and processing

The majority of the paired data used for the host–EBV interaction analyses were newly generated in this study (Extended Data Fig. 1). For host genome genotyping, all data used in the genome-wide interaction scan (n = 3,522) were generated for this study, comprising case–control datasets for the host genome-wide interaction analysis with high-risk EBV status (step 1: 747 cases and 1,251 controls in discovery; and 644 cases and 880 controls in validation). For EBV whole-genome sequences, this study in total included 732 newly sequenced EBV genomes, together with 1,354 EBV genomes from NCBI databases. The EBV genome-wide interaction fine-mapping was conducted in the paired host–EBV genome dataset (n = 734), within which 154 out of 734 EBV genomes were previously deposited in the NCBI in our earlier work^6,44.

Genotyping and quality control of human genetic variants

Extracted DNA was used for human genotyping using the Asian Screening Array (ASA) Chip (Illumina) or OmniZhongHua-8 Chip (Illumina). The genotype data from participants with complete demographic information and EBV genotyping information (BALF2-CCT SNPs 162215A>C, 162476T>C and 163364C>T) were subjected to sample and genotype quality control, as detailed below (see workflow and detailed quality control metrics in Supplementary Table 3 and Supplementary Fig. 2). Genotype quality control was performed using PLINK (v1.9)⁶⁰. Samples were excluded based on the following criteria: (1) overall variant call rate ≤ 92%, (2) heterozygosity rate deviating beyond ± 3 standard deviations from the mean, or (3) relatedness > 20%. Variants were removed if they demonstrated: (1) call rate ≤ 95%, (2) minor allele frequency (MAF) ≤ 5%, or (3) significant deviation from Hardy–Weinberg equilibrium with thresholds of P < 1 × 10⁻⁶ in controls and P < 1 × 10⁻¹⁰ in cases. To detect population outliers and stratification, a MDS approach was employed using either the samples from this study alone or in combination with reference samples from the 1000 Genomes Project (http://www.1000genomes.org/)⁶¹, as illustrated in Supplementary Fig. 3. Following quality control, 1,998 participants from sample set 1, and 1,524 participants from sample set 2 were used for further analysis (Extended Data Fig. 1, Supplementary Table 3 and Supplementary Fig. 2).

HLA imputation and HLA typing

Following quality control of the human genotype data, variants within the HLA region (chromosome 6: 25000000–35000000, GRCh37/hg19) were extracted for imputation. HLA region imputation was conducted using SNP2HLA²³ with the Pan-Asian reference panel^24,25. Post-imputation filtering retained variants with INFO > 0.5 (INFO is an imputation quality metric indicating the reliability of imputed genotypes) and MAF > 5% (Supplementary Table 4 and Supplementary Fig. 2). HLA fine-mapping results were visualized using the code proposed by Raychaudhuri et al.⁶².

To validate the accuracy of HLA imputation derived from our human genotype data, we performed high-resolution HLA typing on a randomly selected subset of 732 successfully imputed samples from sample set 2. In brief, genomic DNA was sheared to approximately 150–200 bp, followed by end-repair and adaptor ligation to construct sequencing libraries. Targeted enrichment of HLA loci was performed using the commercially available TargetSeq Human HLA Panel (iGeneTech). The enriched libraries were sequenced on the Illumina NovaSeq 6000 platform with 150-bp paired-end reads. Sequence data were processed using HLA-HD (v1.2.0)⁶³ for high-resolution HLA typing. Comparative analysis against HLA typing data in 732 samples demonstrated allele-level concordance rates of 92.9% at HLA-A, 88.5% at HLA-B, 95.9% at HLA-C, 88.9% at HLA-DQB1, 87.5% at HLA-DPB1 and 83.8% at HLA-DRB1 loci for the imputation approach.

EBV whole-genome sequencing, variant calling and principal component analysis

We first quantified EBV DNA in each sample by real-time PCR, targeting a fragment of the BALF5 gene (5′ primer: GGTCACAATCTCCACGCTGA; 3′ primer: CAACGAGGCTGACCTGATCC). Samples with an EBV cycle threshold (Ct) value below 31 were selected for subsequent EBV whole-genome sequencing (WGS), as EBV DNA with Ct values above 31 was insufficient for successful WGS (Supplementary Figs. 4 and 7).

Saliva DNA was prepared following the NadPrep EZ DNA Library Preparation Protocol (v2.2). In brief, DNA was fragmented, purified, end blunted, adaptor ligated and then amplified. The DNA libraries were subjected to hybrid capture using the EBV-targeting single-stranded DNA probes developed by Integrated DNA Technologies. After capture enrichment using the Nanodigmbio Hybridization Capture of DNA Libraries Protocol (v1.0), libraries were sequenced (paired-end 150 bp) using the Illumina NextSeq 500 platform.

Raw reads were trimmed and filtered using fastp⁶⁴. The processed paired-end reads were aligned to the EBV B95-8 reference genome (RefSeq accession no. NC_007605.1) using the Burrows–Wheeler Aligner (v0.7.17)⁶⁵. Duplicated reads were removed by Picard (v2.18.14)⁶⁶. Samples with insufficient sequencing coverage (less than 90% coverage at 30× depth) or multiple EBV infections were excluded. Multiple infections were defined as samples with more than 10% biallelic or multi-allelic variants. In addition, VerifyBamID (v1.1.3)⁶⁷ was used to detect mixed infections, using a cut-off (Chipmix > 0.01681) determined from the average Chipmix values for simulated 1.5% mixed infections. A total of 734 samples (324 patients with NPC and 410 healthy donors) with available host genome information in sample sets 1 and 2 passed the EBV genome quality control criteria and were included in genome-wide EBV–HLA interaction analyses.

Following the GATK best practice workflows (v4.1.8.1), variants were identified after base quality score recalibration⁶⁸. To avoid inaccurate calling, we removed low-quality variants using GATK recommended hard-filtering thresholds (threshold for SNPs: ‘QD < 2.0’, ‘QUAL < 30.0’, ‘SOR > 3.0’, ‘FS > 60.0’, ‘MQ < 40.0’, ‘MQRankSum < −12.5’ and ‘ReadPosRankSum < −8.0’; threshold for indels: ‘QD < 2.0’, ‘QUAL < 30.0’, ‘FS > 200.0’ and ‘ReadPosRankSum < −20.0’). After filtering for missingness < 10% and MAF > 10%, 1,942 variants were retained for further analysis (Supplementary Table 9 and Supplementary Fig. 4). Moreover, in the genome-wide EBV–HLA interaction analysis, stratification was performed based on disease status and the investigated HLA-A allele, and only EBV variants with MAF > 5% in each stratum were included in the analysis. The functional annotation of the EBV variants was performed using the SNPEff package (v5.0e) according to the reference genome (NC_007605.1, NCBI annotation, November 2013)⁶⁹.

In the genome-wide EBV–HLA interaction analysis, EBV principal components were incorporated as covariates to adjust for viral population stratification. For principal component analysis (PCA), variants were filtered based on MAF > 10% and linkage disequilibrium pruning with a pairwise correlation r² > 0.5 within a 1,000-SNP sliding window with a 5-SNP step size. PCA was performed using PLINK (v1.9)⁶⁰.

Genotyping of EBV variants by MassArray iPLEX

EBV genotypes, including the BALF2-CCT subtype SNPs 162215A>C, 162476T>C and 163364C>T, were determined using customized primers and the Agena Bioscience MassArray iPLEX platform, following the manufacturer’s protocol, for sample sets 1 and 2 (refs. ^6,70). Using Agena Bioscience MassArray iPLEX platform, 37 EBV markers were genotyped for viral lineage inference, including BALF2-CCT SNPs 162215A>C, 162476T>C and 163364C>T, EBER2 SNP 7048A>C and EBNA3B SNP 84414G>T, in 406 cases and 597 controls from sample set 1 (see Supplementary Table 19). Imputation of SNP 85841 was performed using Beagle (v5.4)⁷¹, with a reference panel of 667 southern China EBV genomes (734 total, excluding 67 used for imputation accuracy evaluation). Among 67 samples with both EBV genotyping and WGS data, the imputation accuracy for SNP 85841 reached 94.03%.

Stepwise human–EBV genome-wide interaction analysis

The near-universal prevalence and large genome size of EBV pose major challenges for genome-to-genome interaction analyses, as adequately powered case–control studies would require prohibitively large samples with paired host–viral genomes. To address these limitations, we applied a stepwise analytical framework. In step 1, we scanned the human genome to identify variants showing interaction signals with the high-risk EBV subtype. In step 2, we fine-mapped the EBV genome to pinpoint viral variants driving the observed human–EBV interactions.

We adopted an approach for genome-wide interaction analysis recently developed by Zhu et al.²¹. This method uses two logistic regression models: one to estimate the overall genetic effects, and the other to estimate the interaction-independent genetic effects. By testing the difference between the overall genetic effect and the genetic effect excluding interaction, it allows robust detection of genome × EBV interaction effects. This stepwise genome-wide interaction analysis is detailed as follows.

Step 1: host genome-wide scan for interaction with the high-risk EBV

In this step, we tested whether the high-risk EBV subtype interacts with host genetic variants (additive coding) to influence NPC susceptibility. To evaluate this interaction, we calculated the difference between the overall and interaction-independent genetic effects, which is theoretically equivalent to testing the combined effect of genome × EBV interaction and mediation²¹. To disentangle interaction from mediation, we evaluated the association between genome and EBV and found no evidence of genetic mediation at a genome-wide significance threshold of 5 × 10⁻⁷ (Supplementary Fig. 10). Thus, statistically significant loci identified in this analysis were interpreted as showing interaction effects.

In a human genome-wide scan for the interaction effects with high-risk EBV subtype, a GWAS on NPC was conducted to examine the genetic (G) contribution to NPC (Y) using a logistic regression model, adjusting for covariates (C) such as sex, age and the first four host MDS components to account for population structure:

$$\rml\rmo\rmg\rmi\rmt\P(Y=1=\alpha _+\alpha _1G+\alpha _2^\prime C$$

(1)

Here, $\alpha _1$ represents the overall effect of G. Then, a genome-wide G × EBV interaction analysis on NPC risk was modelled through the following logistic regression:

$$\textlogit\G,\textEBV,C)\=\beta _+\beta _1G+\beta _2\textEBV+\beta _3G\times \textEBV+\beta _4^\prime C$$

(2)

In this model, $\beta _1$ and $\beta _3$ correspond to the interaction-independent effect of G to differentiate from the overall genetic effect $\alpha _1$ in equation (1) and the interaction effect of G × EBV, respectively. The terms G, EBV and G × EBV represent the host genotype, the infection of EBV subtype and their interaction, respectively.

Then, we tested the G × EBV interaction by the method proposed by Zhu et al.²¹:

$$T_\textinteraction=\frac(\hat\alpha _1-\hat\theta \times \hat\beta _1)^2\rmv\rma\rmr(\hat\alpha _1-\hat\theta \times \hat\beta _1)\sim X_1^2$$

(3)

where $\theta $ reflects the contribution of interaction-independent effect to overall effect. We estimated the causal effect $\theta $ by conducting a linear regression model that assessed the normalized interaction-independent effect’s contribution to the normalized overall effect, denoting as $\widehat\theta $:

$$\frac\hat\alpha _1\textse(\hat\alpha _1)\times \sqrtn=\theta _+\theta \frac\hat\beta _1\textse(\hat\beta _1)\times \sqrtn+\epsilon $$

(4)

where $n$ is the sample size, and $\epsilon $ is the random noise. In the host genome study, $\theta $ in equation (4) was estimated by all host genetic variants. The genetic variants without G × EBV interaction will fall on the regression line but the variants with G × EBV interaction will depart from this line. Therefore, we searched the host genetic variants that deviate from this regression line to test the effect of G × EBV interaction.

Host genome-wide interaction analysis was conducted using the iterative Mendelian randomization and pleiotropy (IMRP) approach⁷². The statistic $T_\mathrminteraction$ in equation (3) provides a statistical assessment of the significance of G × EBV interaction effects. To account for the effect of sample overlapping between the GWAS and interaction analysis, the IMRP method incorporated a correlation coefficient of standardized $\alpha _1$ and $\beta _1$, calculated from genome-wide variants lacking significant associations (P > 0.05). Following causal effect $\theta $ estimation, a genome-wide interaction test, conceptually analogous to pleiotropy testing in IMRP, was applied. The quantile–quantile plot indicates effective control of type 1 error in human genome-wide interaction analysis (Extended Data Fig. 2a–c). For conditional analyses, the GWAS and interaction analyses were performed by incorporating the top variant and the EBV–interaction term as covariates in two logistic regression models.

Step 2: EBV genome-wide scan for interaction with HLA-A*11:01

In this step, we tested whether EBV variants (binary coding) interact with HLA-A*11:01 (presence or absence) to influence NPC risk. To evaluate this interaction, an EBV GWAS was conducted to examine the EBV variant (G_EBV) contribution to NPC (Y) using a generalized linear mixed model. Specifically, to control for viral population structure, we constructed a viral genetic relatedness matrix (vGRM) from EBV variants after linkage disequilibrium-based pruning (sliding window of 1,000 SNPs, step of 1 SNP, r² < 0.4) using the GEMMA software (v0.98.5)⁷³, following the recommended practice and previous work on highly linked genomes^6,74,75. We also performed sensitivity analyses using alternative linkage disequilibrium-pruning threshold r² < 0.3 or 0.6, and the top-ranked interaction signal for EBV 85841G remained unchanged. We then fit a logistic mixed model in the GMMAT software (v1.4.2)⁷⁶, using this vGRM as a random effect (Z) and including age, sex, the top four viral principal components and the top four host MDS components as fixed effects (C) to account for viral and host population structure (equation (5)):

$$\textlogit\G_\textEBV,C,Z)\=\alpha _^\textEBV+\alpha _1^\textEBVG_\textEBV+\alpha _2^\textEBV^\prime C+Z$$

(5)

Here $\alpha _1^\mathrmEBV$ represents the overall effect of G_EBV. The random effect Z was assumed to follow a multivariate normal distribution with mean zero and covariance matrix proportional to the viral genetic relatedness matrix, that is, $Z \sim N(0,\sigma _\mathrmEBV^2\mathrmvGRM)$, to account for genome-wide EBV genetic similarity among samples. Then, a viral genome-wide G_EBV × HLA interaction analysis on NPC risk was modelled through the following logistic mixed model:

$$\beginarrayl\mathrmlogit\P(Y\,=\,1=\beta _^\mathrmEBV+\beta _1^\mathrmEBVG_\mathrmEBV+\beta _2^\mathrmEBV\mathrmHLA\\ \,+\,\beta _3^\mathrmEBVG_\mathrmEBV\times \mathrmHLA+\beta _4^\mathrmEBV^\prime C+Z\endarray$$

(6)

In equation (6), $\beta _1^\mathrmEBV$ and $\beta _3^\mathrmEBV$ correspond to the interaction-independent effect of G_EBV to differentiate from the overall EBV effect $\alpha _1^\mathrmEBV$ in equation (5) and the interaction effect of G_EBV × HLA, respectively. The terms $G_\mathrmEBV$, $\mathrmHLA$ and $G_\mathrmEBV\times \mathrmHLA$ represent the EBV genotype, the HLA variant and their interaction, respectively.

Then, we tested the G_EBV × HLA interaction by the statistic proposed by Zhu et al.²¹:

$$T_\rmi\rmn\rmt\rme\rmr\rma\rmc\rmt\rmi\rmo\rmn=\frac{{\left(\hat\alpha _1^\textEBV-\hat\theta ^\textEBV\times \hat\beta _1^\rmE\rmB\rmV\right)}^2}{\rmv\rma\rmr\left(\hat\alpha _1^\textEBV-\hat\theta ^\textEBV\times \hat\beta _1^\rmE\rmB\rmV\right)}\sim X_1^2$$

(7)

where $\theta ^\mathrmEBV$ reflects the contribution of interaction-independent effect to overall effect. We estimated the causal effect $\theta ^\mathrmEBV$ by conducting a linear regression model that assesses the normalized interaction-independent effect’s contribution to the normalized overall effect, denoting as $\hat\theta ^\mathrmEBV$:

$$\frac\hat\alpha _1^\textEBV\textse\left(\hat\alpha _1^\textEBV\right)\times \sqrtn=\theta _^\textEBV+\theta ^\textEBV\frac\hat\beta _1^\textEBV\textse\left(\hat\beta _1^\textEBV\right)\times \sqrtn+\epsilon $$

(8)

where $n$ is the sample size and $\epsilon $ is the random noise. In the EBV genome study, $\theta ^\mathrmEBV$ in equation (8) was estimated using independent EBV variants, which were selected based on a linkage disequilibrium (r²) threshold of 0.3 across the whole EBV genome. We searched the EBV variants that deviate from this regression line to test the effect of G_EBV × HLA interaction.

The EBV genome interaction test was performed by the IMRP⁷². The statistic $T_\mathrminteraction$ in equation (7) is a test for evaluating the effect of G_EBV × HLA interaction. After estimating the causal effect $\theta ^\mathrmEBV$, we performed the interaction test for all EBV variants. In the conditional analysis of the interaction test, the top EBV variant was included as covariates in two logistic mixed models.

To control for multiple testing and type 1 error, we used the method previously presented²⁷. BRASS first fit a generalized linear mixed model that included host covariates (age, sex and host MDS components), viral principal components and vGRM to control the viral population structure and sample relatedness, excluding the viral SNP, host allele and their interaction (the null model). We computed the residuals and decorrelated them to obtain approximately exchangeable residuals. We then permuted these decorrelated residuals 10,000 times, and re-fit the interaction model on each replicate. This approach preserves the host–viral relatedness structure and accounts for clonality and relatedness to the viral genome while breaking only the association of interest. Using the minimum P value across the EBV genome in each permutation, we set the empirical genome-wide threshold at α = 0.05, yielding an empirical genome-wide threshold of 5.92 × 10⁻⁴ for the EBV interaction scan.

Power simulation

To quantify power under this stepwise design, we performed simulations following a previous method⁷⁷. In step 1, we resampled the discovery dataset 1,000 times and simulated NPC case–control status using the estimated main and interaction effects of the top interaction signal rs417162 (tagging HLA-A*11:01; r² = 0.73), high-risk EBV (BALF2-CCT) and their interaction. The power to rediscover the top interaction at the genome-wide significance threshold (P < 5 × 10⁻⁸) was 83%. In step 2, we randomly resampled individuals and their EBV genomes from the paired host–EBV genome dataset and simulated NPC case–control status using the estimated effects of HLA-A allele, EBV SNP 85841 and their interaction. Across 1,000 simulations, the power to detect the HLA-A allele × 85841G interaction at the Bonferroni-corrected significant threshold for the EBV genome scan was estimated. The power to detect the A*11:01 × 85841G and A*02:07 × 85841G interactions was 91% and 72%, respectively. The results of power calculations are consistent with previous human G × G power analyses and support that the stepwise design is statistically efficient under current sample-size realities^77,78,79,80.

Relative excess risk and attributable fraction due to interaction

RERI

RERI quantifies the magnitude to which the joint effect of two factors (here HLA and EBV) exceeds the sum of their individual effects, providing population-level metrics essential for evaluating absolute risk differences in preventive interventions^28,81. The OR for NPC associated with the joint status of the presence of HLA risk variant (HLA = 1) and the high-risk EBV subtype (EBV = 1) was defined as $\mathrmOR_11$. We estimated the interaction effect between the HLA variant and the EBV subtype infection on NPC risk as the relative excess RERI²⁸. We first fitted the following logistic regression model for NPC:

$$\textlogit\P(Y=1=\beta _+\beta _1\textHLA+\beta _2\textEBV+\beta _3\textHLA\times \textEBV+\beta _4^\prime C$$

(9)

Here, $Y=\mathrm1,0$ represents the NPC case or control status, $\mathrmHLA=\mathrm1,0$ indicates the presence or absence of HLA variant, and $\mathrmEBV=\mathrm1,0$ denotes the presence or absence of EBV subtype, and $C$ represents the set of covariates, as indicated in each table or figure legends. Because NPC is a rare disease, with an incidence of approximately 20 per 100,000 person-years in southern China^82,83, we used OR as an approximation of the relative risk in the interaction analyses. Under the rare outcome assumption, then

$$\textRERI\,\approx \,\textOR_11-\textOR_10-\textOR_01+1=\exp (\beta _1+\beta _2+\beta _3)-\exp (\beta _1)-\exp (\beta _2)+1$$

(10)

The 95% CI for the relative excess RERI was analysed using bootstrap resampling (5,000 iterations) implemented in the R package boot (v1.3.30) or delta method. The P value for the relative excess RERI was calculated using the method proposed by VanderWeele et al.⁸¹, which uses the low-risk genotype and low-risk exposure as the reference group and applies a one-tailed test.

Attributable fraction due to interaction

The attributable fraction due to interaction quantifies the proportion of disease risk, associated with a given exposure, that can be explained by its interaction with another factor. We estimated the proportion of HLA-associated risk that is attributable to its interaction with high-risk EBV strains (Extended Data Fig. 2h). The OR of NPC (Y = 1) associated with the joint status of HLA risk genotype (HLA = 1) and the high-risk EBV subtype (EBV = 1) was defined as $\mathrmOR_11$, which was calculated with a logistic regression model of NPC status (Y) on the HLA variants (HLA), the EBV subtypes (EBV) and their interactions (HLA × EBV), adjusting for age, sex and the top four host MDS components. The NPC risk due to an HLA risk genotype was calculated with $\mathrmOR_10-1$, whereas the NPC risk due to a high-risk EBV subtype was calculated with $\mathrmOR_01-1$. The increased NPC risk due to the coexistence of both HLA risk genotype and high-risk EBV subtype was $\mathrmOR_11-1$. The frequency of the high-risk EBV subtype in healthy controls is denoted as $P(\mathrmEBV=1)$, whereas the frequency of HLA risk genotype in healthy individuals is represented as $P(\mathrmHLA=1)$. Accordingly, the proportion of HLA-associated risk attributable to interaction was calculated as $\frac(\mathrmOR_11-\mathrmOR_10-\mathrmOR_01+1)P(\mathrmEBV=1)(\mathrmOR_10-1)+(\mathrmOR_11-\mathrmOR_10-\mathrmOR_01+1)P(\mathrmEBV=1)$⁸⁴. The 95% CI for the attributable fraction due to interaction was estimated via bootstrap resampling (5,000 iterations).

Population attributable fraction and NPC incidence estimation

PAF

PAF estimates the attributable fraction for a binary outcome (here NPC Y = 1) under the hypothetical scenario where the risk factor (X) is eliminated from the population and corresponds to the attributable fraction of disease outcome explained by the risk factor. The PAFs explained by the effects of HLA-A alleles (presence or absence), EBV subtypes (presence or absence) or their combinations (Fig. 4b,c, Extended data Fig. 2h and Supplementary Tables 15 and 18) were estimated using logistic regression models adjusting for covariates (C), including age, sex and the top four human MDS components, with the R package AF (v0.1.5)⁸⁵.

$$\textlogit\P(Y\,=\,1=\gamma _+\gamma _1X+\gamma _2^\prime C$$

(11)

In equation (11), $\gamma _1$ corresponds to the effect of the risk factor (X; HLA-A alleles or EBV subtypes). Because NPC is a rare disease^82,83, the relative risk can be approximated by the OR. The attributable fraction of NPC explained by the risk factor can be approximated by

$$\textAF\,\approx \,1-E_C\\textOR^-X(C)$$

(12)

NPC incidence estimation across EBV and HLA strata

Age-standardized incidence rates (ASRs; per 100,000) for NPC in Asia were sourced from: (1) the 2020 China Cancer Registry Annual Report⁸⁶, (2) the 2021 Hong Kong Cancer Registry⁸⁷, (3) the WHO International Agency for Research on Cancer databases³, and (4) peer-reviewed literature⁸⁸. HLA-A allele frequencies were obtained from three resources: our dataset, the NyuWa Genome Resource and published studies^89,90,91,92. To evaluate population burden across strata defined by EBV strains and HLA-A background in southern China, we estimated stratum-specific NPC incidence and preventable cases. NPC incidence rates for each stratum were derived based on stratum-specific odds ratio to the overall regional NPC incidence of approximate 20 per 100,000 person-years^82,83, under the rare-disease approximation. We further translated the PAF into the estimated number of attributable cases under the counterfactual assumption^46,47 that exposure to high-risk EBV strains could be eliminated. For each stratum $i$, preventable cases were computed as Preventable cases_i = PAF_i × C_South, where C_South = 51,000 × 0.80 reflects an annual national NPC burden of approximately 51,000 cases⁹³ and approximately 80% occurring in southern China.

Peptide–HLA-binding assay

Prediction of HLA-A allele-binding affinity

Peptide–HLA-A*11:01-binding or A*02:07-binding affinities were predicted by NetMHCpan 4.1 (ref. ³⁵). HLA–peptide-binding predictions were determined based on percentile ranks for 9–11-mer peptides spanning the EBV SNP 85841A>G-encoded Q900R variant in EBNA3B from high-risk 85841G and non-high-risk EBV strains. Peptides with normalized percentile ranks of less than 2% were defined as binders to the respective HLA-A allele (Supplementary Tables 20 and 21).

AlphaFold3-based structure prediction

The structures of peptide–HLA-A*11:01 complexes for the low-risk peptide VVILENVGQ (85841A encoded) and the high-risk peptide VVILENVSR (85841G encoded) were predicted using the AlphaFold3 public web server⁹⁴. Inputs comprised the HLA-A*11:01 heavy chain, β₂M and the EBNA3B peptide sequence, using the HLA-A*11:01–β₂M scaffold derived from Protein Data Bank 6JOZ as a structural ref. ⁹⁵. Molecular graphics were generated in PyMOL (v3.1.6.1)⁹⁶.

Biolayer interferometry assay

Peptides with a purity of more than 95% were synthesized by GenScript Biotech. Biotinylated UV-exchangeable HLA-A*11:01–RVFA-J-SFIK and HLA-A*02:07–LLDSD-J-ERL peptide–MHC monomers were purchased from BetterGen. Biolayer interferometry was performed using an Octet R8 (Sartorius), and data were analysed with Octet Analysis Studio (v12.2) software. After UV irradiation for 20 min on ice to degrade the original A*11:01-bound peptide RVFA-J-SFIK, biotinylated HLA-A*11:01 was loaded onto SSA Biosensors (Sartorius 18-5057) at a concentration of 10 μg ml⁻¹ for 480 s. Coated sensor tips were dipped into dilution buffer for a 60-s baseline measurement, then into peptide solutions at serially diluted concentrations for a 90-s association phase, and finally into dilution buffer for a 120-s dissociation phase. Peptide concentrations are indicated in each figure. The equilibrium K_d was determined by steady-state analysis of data at equilibrium for each peptide concentration, a method suitable for protein–small-molecule interactions exhibiting fast on and off rates and low signal intensity. The HLA-A*11:01-restricted epitope IVTDFSVIK served as the positive control (K_d = 54.7 μM), whereas the peptide LLWTLVVLL, which cannot bind to A*11:01, was used as the negative control (K_d = 242.7 μM).

HLA-A peptide exchange ELISA

First, 20 µl diluted peptide (400 µM in PBS) and 20 µl HLA-A monomer (0.2 mg ml⁻¹) were added to a microcentrifuge tube. UV-mediated peptide exchange was performed on ice for 30 min under 365-nm UV irradiation. The Flex-T Human Class I Peptide Exchange ELISA Kit (BioLegend) was used to evaluate the efficiency of UV-activated peptide exchange according to the manufacturer’s instructions. DMSO was used as the blank control. For the HLA-A*11:01 monomer, IVTDFSVIK was used as the positive control and CLGGLLTMV as the negative control. For the HLA-A*02:07 monomer, CLGGLLTMV was used as the positive control and IVTDFSVIK as the negative control.

T cell functional assay

EBV production

EBV strains B95-8, Akata and M81 were produced by the B95-8 cells, CNE2-Akata cells and CNE2-M81 cells, respectively^{50,97,98,99,100}. In brief, B95-8 cells were resuspended at a density of 2 × 10⁵ cells per millilitre in RPMI 1640 medium with low-serum condition (2% fetal bovine serum (FBS)), 100 U ml⁻¹ penicillin and 100 µg ml⁻¹ streptomycin to trigger the EBV lytic cycle for 2 weeks. To induce CNE2-Akata and CNE2-M81 cells, 20 ng ml⁻¹ 12-O-tetradecanoylphorbol 13-acetate (TPA; Beyotime) and 2.5 mM sodium butyrate (NaB; Sigma-Aldrich) were added in RPMI 1640 medium with 10% FBS, 100 U ml⁻¹ penicillin and 100 µg ml⁻¹ streptomycin for 12 h. Then, the medium was replaced with fresh complete medium, and cells were cultured for 3 days. Cell supernatants were collected and filtered by 0.45-µm membrane filter (Millipore). EBV was concentrated by centrifugation at 50,000g for 2.5 h and resuspended in RPMI 1640 medium. Virus stocks were stored at −80 °C until use.

Cell line construction

COS-7 cell lines, each overexpressing a single HLA–β₂M complex (A*11:01, A*02:07 or A*02:01) and luciferase, were gifted by the Li laboratory³⁶. In addition, expression plasmids encoding EBV EBNA3B haplotypes from the high-risk 85841G, high-risk 85841A and low-risk strains were individually transiently transfected into the HLA-A*11:01 and luciferase co-expressing COS-7 cell line using polyethylenimine (PEI; Polysciences) at a DNA:PEI ratio of 1:3, followed by incubation for 24 h.

To establish EBV-infected LCLs, PBMCs were resuspended in B95-8–EBV-containing, Akata–EBV-containing or M81–EBV-containing solutions at a density of 5 × 10⁵ cells per millilitre, respectively¹⁰¹. After 12 h, cells were centrifuged and cultured at 48-well plates in RPMI 1640 medium with 10% FBS, 1% (v/v) GlutaMax (Gibco), 1 mM sodium pyruvate (Gibco), 2 µg ml⁻¹ cyclosporin A (CsA; Yeasen), 100 U ml⁻¹ penicillin and 100 µg ml⁻¹ streptomycin. Cell culture medium was replaced weekly (50% volume) with fresh complete medium containing 2 µg ml⁻¹ CsA. After approximately 4–6 weeks, the LCLs were stable and proliferated continuously, and were used for subsequent experiments. Successful establishment of LCLs was confirmed by EBER fluorescence in situ hybridization (EBER-FISH) and flow cytometric analysis of B cell markers.

Peptide-specific T cell expansion

Peptide-specific T cells were generated as previously described^102,103. In brief, PBMCs (1 × 10⁶ cells) were cultured in 24-well plates containing complete RPMI 1640 medium, consisting of RPMI 1640 medium supplemented with 10% FBS, 100 U ml⁻¹ penicillin, 100 µg ml⁻¹ streptomycin, 10 mM HEPES (Gibco), 1 mM sodium pyruvate (Gibco), 1% (v/v) non-essential amino acids (Gibco), 1% (v/v) GlutaMax (Gibco) and 50 µM β-mercaptoethanol (Sigma-Aldrich). On days 1 and 5, PBMCs (5 × 10⁵ cells) were incubated with single peptide (10 µM) or a peptide pool (2 µM per peptide) for 2 h at 37 °C in 5% CO₂. Peptide-pulsed PBMCs were washed once and then mixed with the autologous PBMC cultures. After 3 days of cell culture, cultures with 10 U ml⁻¹ recombinant human IL-2 (PeproTech), 10 ng ml⁻¹ recombinant human IL-7 (Bio-Techne) and 10 ng ml⁻¹ recombinant human IL-15 (PeproTech) were added, and half of the medium was replaced every 2 days thereafter. On day 13, expanded T cells were washed and rested overnight in cytokine-free medium.

T cell activation assay (intracellular cytokine staining)

HLA-A-expressing COS-7 cell lines (used as APCs) were pulsed with peptides in 96-well U-bottom plates and incubated for 2 h at 37 °C. After incubation, cells were washed twice and then resuspended in complete RPMI 1640 medium for subsequent activation and cytotoxicity assays. Peptide-expanded T cells were co-cultured with peptide-pulsed APCs, EBV EBNA3B-transduced APCs or LCLs in complete RPMI 1640 medium supplemented with anti-human CD28 (1 µg ml⁻¹; CD28.2, BD Biosciences) and anti-human CD49d (1 µg ml⁻¹; 9F10, BD Biosciences). Co-cultures were set up in 96-well U-bottom plates at an effector-to-target ratio of 1:5 (5 × 10⁴ effector cells and 2.5 × 10⁵ target cells per well) and incubated at 37 °C for 2 h. Brefeldin A (10 µg ml⁻¹) and monensin (1 µM) were then added, followed by incubation at 37 °C for 6 h. Cells were subsequently kept overnight at 4 °C. Surface staining was performed for 1 h at 4 °C using the following reagents diluted at 1:100, including Fixable Viability Dye eFluor 506 (eBioscience), APC/cyanine7 anti-human CD45 (HI30, BioLegend), PE-Cy7 anti-human CD3 (SP34-2, BD Biosciences) and PerCP-Cy5.5 anti-human CD8 (RPA-T8, BD Biosciences). Cells were then washed, fixed and permeabilized for 1–5 h, followed by intracellular staining with Alexa Fluor 647 anti-human IFNγ (BD Biosciences; Supplementary Fig. 11a). Intracellular cytokine staining was performed using a CytoFLEX LX flow cytometer (Beckman Coulter). Data were analysed using CytExpert (Beckman Coulter) and FlowJo (v10; TreeStar).

T cell cytotoxicity assay

Peptide-expanded T cells were co-cultured with peptide-pulsed APCs, EBV EBNA3B-transduced APCs or LCLs in complete RPMI 1640 medium in 96-well U-bottom plates at an effector-to-target ratio of 5:1 (2.5 × 10⁴ effector cells and 5 × 10³ target cells per well). Co-cultures were incubated for 24 or 48 h at 37 °C in 5% CO₂. For luciferase-expressing COS-7 APCs, T cell-mediated cytotoxicity was assessed by measuring the luciferase activity of the remaining viable target cells. After co-incubation, plates were centrifuged at 1,000 rpm for 5 min and washed twice to remove dead cells and cellular debris. Cells were then resuspended in culture medium, mixed with an equal volume of Bio-Lite Plus detection reagent (Bio-Lite Plus Luciferase Assay System, Vazyme) and incubated at room temperature for at least 3 min. Luminescence was measured using a multimode microplate reader (Spark 10 M, Tecan). Percent specific lysis was calculated using the following formula: Specific lysis (%) = 100 × (1 – test luminescence/maximum luminescence).

T cell-mediated cytotoxicity against LCL targets was assessed by flow cytometric analysis of target cell death. After co-incubation of LCLs and peptide-expanded T cells, cells were harvested and washed twice with PBS and then incubated with BV421 anti-human CD19 (HIB19, BioLegend) antibody for 30 min at 4 °C. Then, cells were washed with binding buffer (BioLegend) three times and stained with APC Annexin V Apoptosis Detection Kit (BioLegend) together with SYTOX AADvanced (Invitrogen) according to the manufacturers’ instructions (Supplementary Fig. 11b). LCLs co-cultured without peptide-expanded T cells served as the baseline control. Samples were analysed on a CytoFLEX LX flow cytometer (Beckman Coulter). Percent specific lysis was calculated as Specific lysis (%) = 100 × (1 – cell survival in test group/cell survival in control group).

IFNγ ELISPOT assay

The IFNγ enzyme-linked immunospot (ELISPOT) assay was conducted following the manufacturer’s instructions (Human IFNγ ELISPOT Set, BD Biosciences). In brief, ELISPOT plates were coated with the capture antibody (NA/LE purified anti-human IFNγ) overnight at 4 °C. After washing once, plates were blocked with complete RPMI 1640 medium for 2 h at room temperature. Peptide-expanded T cells (1 × 10⁶ cells per millilitre) and single peptide (2 µM) or peptide pools (1 µM per peptide) in complete RPMI 1640 medium were plated and incubated at 37 °C with 5% CO₂ for 18–22 h. DMSO was used as a negative control. After incubation, the detection antibody (biotinylated anti-human IFNγ) was added and incubated at room temperature for 2 h, followed by incubation with streptavidin–horseradish peroxidase for 1 h at room temperature. The 3-amino-9-ethylcarbazole substrate (AEC Substrate Set, BD Biosciences) was then added to each well for 15–45 min, and deionized water was used to stop the substrate reaction. Spots were enumerated using an AID ELISPOT Reader.

Tetramer staining

PBMCs from patients with NPC or healthy controls were stained for 1 h at 4 °C with a PE-conjugated HLA-A*11:01 tetramer loaded with VVILENVSR (MBL), AVFDRKSDAK (BetterGen) or IVTDFSVIK (BetterGen), and with an APC-conjugated HLA-A*11:01 tetramer loaded with ATAAAAAAK (MBL) as a negative control. Cells were subsequently surface-stained with Fixable Viability Dye eFluor 506 (eBioscience), FITC anti-human CD3 (HIT3a, BioLegend) and PerCP-Cy5.5 anti-human CD8 (RPA-T8, BD Biosciences; Supplementary Fig. 11c). Flow cytometry was performed using a CytoFLEX LX flow cytometer (Beckman Coulter). Data were analysed using CytExpert (Beckman Coulter) and FlowJo (v10; TreeStar).

EBV population genetic, phylogenetic and selection analysis

EBV genome alignment

High-quality EBV WGS data were obtained from 388 cases and 453 controls from southern China, and 113 healthy controls from other areas in China, after EBV genome quality control. In addition, we retrieved raw EBV WGS fastq data for 176 individuals from published studies in the NCBI databases. Applying the same quality control criteria, EBV WGS data from 118 of these individuals were retained and included in subsequent analyses (for details, see Supplementary Table 16 and Supplementary Fig. 7). Variant calling was performed following GATK best practices workflow (see the section ‘EBV variant calling’ for details). Subsequently, consensus genome sequences were generated by integrating sample-specific variants into the reference genome, with ‘N’ indicating a missing value. The proportion of reads supporting the major allele was assessed for each site. If this proportion exceeded 60%, the major allele was assigned as the consensus base. If the supporting read proportion for the major allele was below 60%, this ambiguous site was masked with N to mitigate potential artefacts. A total of 1,072 consensus genome sequences were generated for the following multiple sequence alignment.

In addition, we retrieved 1,130 assembled EBV genomes (more than 100 kb) from the NCBI database (taxon ID: 10376; accessed July 2022), excluding any sequences that overlapped with the 1,072 consensus genomes generated from raw fastq data in this study. As raw sequencing data were unavailable for these public genomes, only the FASTA-format assemblies were used. Metadata, including geographical origin, EBV type and diseases of carriers, were extracted for all sequences (Supplementary Table 16). Together with the 1,072 consensus sequences generated in this study, all 2,202 genomes were included in the subsequent multiple sequence alignment.

Multiple sequence alignment was performed using MAFFT (v7.490)¹⁰⁴ with the ‘–keeplength’ parameter to preserve the original length of the reference genome (171,823 bp) and establish homologous alignment. To ensure high alignment quality, we first masked repetitive regions (annotated in NC_007605.1, 20.7% of the EBV genome) and low-coverage regions defined as more than 50% N characters across all alignments in a 100-bp window. Subsequently, sequences with missing values exceeding 25% were excluded from further analysis. Following these quality control steps, a total of 2,086 EBV genomes were retained for downstream analyses.

Tree construction

EBV has been classified into two major types, type 1 and type 2, which diverged early in evolution and exhibit distinct genomic features, particularly in the EBNA2, EBNA3A, EBNA3B and EBNA3C gene regions, with type 1 representing the predominant lineage globally. To account for this deep evolutionary split, we first distinguished type 1 and type 2 strains before downstream analyses. We performed PCA based on genome-wide variation from 2,086 EBV strains. Variants were identified based on the multiple sequence alignment, and variants with high levels of missing data (more than 10%) or a MAF below 5% were removed using VCFtools (v0.1.13)¹⁰⁵. PCA was subsequently performed with PLINK (v1.9)⁶⁰ and identified two genetically distinct EBV clusters. These clusters were primarily differentiated by variation within the EBNA2 and EBNA3 genes, corresponding to the EBV type 1 and type 2 lineages (Supplementary Fig. 12). Given the clear separation and the global predominance of type 1, subsequent analyses were restricted to this group (n = 1,852).

The maximum likelihood of the phylogenetic relationship was inferred using IQ-TREE 2 (v2.2.0)¹⁰⁶. The optimal model of nucleotide substitution was selected using the ModelFinder function within IQ-TREE 2, based on the Bayesian Information Criterion¹⁰⁷. The model TVM + F + I + R7 was identified as the best fitting and was used for final tree construction, with statistical support assessed using 1,000 non-parametric bootstrap pseudoreplicates. To evaluate the robustness of phylogenetic inference to recombination, we performed a recombination-masked analysis by excluding the inferred N1-derived recombinant region (approximately 54–130 kb) from the multiple sequence alignment and reconstructing the maximum likelihood tree using the same IQ-TREE 2 workflow.

The phylogeny was rooted using an African-derived EBV strain (KP968262.1), as African lineages are known to occupy basal positions in global EBV trees, consistent with the co-evolutionary history of EBV and its human host originating in Africa. Rooting was performed using the root function in R package ape (v5.8.1)¹⁰⁸. All phylogenetic trees were subsequently visualized and annotated using R packages ape (v5.8.1), treeio (v1.30.0), ggtree (v3.14.0) and ggtreeExtra (v1.16.0)^{108,109,110,111}.

Chromosome painting analysis

To investigate the ancestry composition and detect potential admixture signals in EBV strains, we applied a chromosome painting approach using ChromoPainter, embedded within the fineSTRUCTURE (v4.1.0)¹¹². ChromoPainter uses a hidden Markov model to reconstruct each recipient haplotype as a mosaic of genomic segments copied from the donor haplotypes. A total of 132 EBV genomes from northern China (N1) and 116 from southern China (S1) were designated as donors to paint 643 recipient genomes (C1 and X1).

As ChromoPainter requires phased haplotypes, we first converted filtered VCF files into ChromoPainter-compatible.phase format using PLINK (v1.9)⁶⁰ and a companion script supplied by ChromoPainter (plink2chromopainter). ChromoPainter was first run in estimation mode (-a 0 0) using all samples to infer the effective population size (Ne) and mutation rate (μ). In the subsequent painting step, these estimated parameters were applied, with the number of expectation-maximization iterations set to 10, and the -k parameter set to 20, defining the expected number of chunks copied from donors.

Genomic similarity and recombination analysis

Genetic similarity between consensus genome sequence of 85841G-carrying high-risk strains and other strains on the phylogeny was accessed using a sliding window approach (1,000-bp window size and 100-bp step size). To detect recombination, we generated consensus genome sequences for each focal clade (for example, S1, N1, X1 and C1) using EMBOSS (v6.6.0.0)¹¹³, and conducted recombination analysis on these consensus sequences using RDP5 (ref. ¹¹⁴). Genetic differentiation between clades was evaluated by calculating the fixation index (FST, Weir and Cockerham’s method), implemented in VCFtools (v0.1.13)¹⁰⁵, with the same window parameters.

Extended haplotype homozygosity test

In healthy individuals from southern China, extended haplotype homozygosity spanning EBV variants (that is, SNP 85841, SNP 84414, SNP 84462, SNP 7048 and SNP 163364) was estimated using the R package rehh (v3.2.2)¹¹⁵.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.