Plant material and sequencing
We selected a total of 149 rice accessions according to the phylogenetic relationships of 1,529 accessions reported previously19, which were preserved at the China National Rice Research Institute in Hangzhou, China, and the National Institute of Genetics in Mishima, Japan. For sampling, all accessions were cultivated in Lingshui County, Hainan.
We extracted genomic DNA from fresh leaves, which were then immediately flash-frozen in liquid nitrogen and stored at −80 °C. HiFi sequencing was performed on the Pacific Biosciences Sequel II platform for 133 accessions, and 16 accessions were subjected to nanopore sequencing on the PromethION platform, following their respective standard protocols. After removing low-quality reads, the average N50 lengths of pass reads were 16 kb for the HiFi-sequenced accessions and 26 kb for the ONT-sequenced accessions. The sequencing yielded 6.8–17.4 Gb of HiFi reads for the 133 rice accessions and 28.4–55.3 Gb of nanopore reads for the 16 rice accessions. The leaves of 142 rice accessions were also used to extract genomic DNA and construct Illumina libraries with a 350-bp insert size, which were then sequenced on the HiSeq 4000 platform (Illumina). Detailed information on the geographical origins and sequencing coverage of these accessions can be found in Supplementary Tables 1 and 2.
We selected leaves at the seedling stage from 30 representative varieties for Hi-C sequencing. Following the standard protocol, Hi-C libraries were constructed from these samples, which were subsequently sequenced on the NovaSeq platform (Illumina).
Tissues were collected from various plant parts, including leaves during vegetative growth from 149 accessions, roots from 146 accessions, seedlings (about 15 days old) from 137 accessions and panicles (3–5 cm) from 122 accessions. RNA was then extracted using TRIzol (Invitrogen). RNA sequencing (RNA-seq) libraries with an insert size of 350 bp were prepared and sequenced on the HiSeq 4000 platform (Illumina) according to the instructions provided.
Genome assemblies
We assembled the genomes of 133 accessions sequenced with HiFi technology using Hifiasm61 (v.0.16.0), with default parameters. This process yielded both p-contigs and a-contigs. For seven accessions sequenced with nanopore technology, we used NECAT62 (v.20200803) with default parameters for assembly. The remaining nine accessions were assembled using NextDenovo63 (v.2.1). To improve the consistency and single-base accuracy of nanopore genomes, we used Racon64 (v.1.0.0) for one round of polishing using nanopore reads, followed by two rounds of polishing with NextPolish65 (v.1.0.2) using Illumina reads. To exclude organelle genomes, we aligned the chloroplast (NC 001320.1) and mitochondrion (NC 011033.1 and NC 001751.1) reference sequences from IRGSP-1.0 against the assembly contigs of each accession using MUMmer66 (v.4.0.0beta2). Contigs showing more than 50% coverage and measuring less than 500 kb were specifically targeted and removed. The remaining contigs, attributed to the nuclear genome, were subsequently evaluated and anchored to pseudo-chromosomes using ALLMAPS67.
To achieve chromosome-level genome assembly, the obtained Hi-C reads were aligned to contigs using Chromap68 (v.0.2.6). These alignments were then converted into bed or bam format by SAMtools69 (v.1.20), and processed with YaHS70 (v.1.1). Subsequently, Hi-C contact maps for visualization were generated by juice_tools (v.1.19.02), followed by manual optimization with JuiceBox Assembly Tools71 (v.1.9.8).
Evaluation of genome assemblies
To assess the quality of genome assemblies, we implemented several indexes. First, we evaluated gene completeness using the embryophyta_odb10 database, using BUSCO72 (v.5.2.2), and repeat completeness on the basis of the LTR assembly index (LAI)25, using LTR_retriever (v.2.9.0) with parameters ‘-maxlenltr 7000’. Furthermore, the overall assembly quality (QV) was measured using Inspector26 (v.1.0.1), a reference-free assembly evaluator. To round off our evaluation, the number of mismatches between the Nipponbare genome assembled in this study and the reference genomes IRGSP-1.0 and T2T-NIP was assessed using QUAST73 (v.5.0.1).
For the syntenic analysis at the genome level between the Nipponbare (IRGSP-1.0) genome and four representative genomes in O. sativa (Gla4), O. rufipogon (W1943), O. longistaminata (OL3101) and O. meridionalis (OM1952), we used the nucmer program from MUMmer66 (v.4.0.0beta2). Syntenic blocks identified were filtered using the delta-filter program with parameters ‘-i 85 -l 5000 -o 85’ and then visualized with the mumplot program.
We performed a comparative genomic analysis between chromosomes constructed using Hi-C technology and pseudo-chromosomes obtained with the ALLMAPS method. At first, we aligned the genomes using the nucmer program from MUMmer66. Syntenic blocks identified were filtered using the delta-filter program with parameters ‘-m’. Subsequently, we compared alignments between two chromosome-level assemblies and identified synteny blocks and structural rearrangements using SyRI74 (v.1.4).
Structure and functional annotation of protein-coding genes
Three distinct strategies, comprising ab initio, homology-based and transcriptome-based predictions, were integrated to generate the predicted gene models. For ab initio prediction, four different programs were used: FGENESH+75 (v.3.1.1), SNAP76 (v.2006-07-28), GeneMark-ES77 (v.4.68_lic) and AUGUSTUS78 (v.3.3.2). In the homology-based prediction, homologous protein sequences, sourced from the Rice Genome Annotation Project database (v.7.0, http://rice.plantbiology.msu.edu), were aligned to the assembled genomes using GenomeThreader79 (v.1.7.1). RNA-seq reads from four different tissues of each accession were mapped to their respective assembled genomes using HISAT2 (ref. 80) (v.2.0.5) and then assembled into transcripts with StringTie81 (v.2.0). We performed both de novo and genome-guided RNA-seq assemblies using Trinity82 (v.2.12.0), subsequently aligning them to the genome assemblies with PASA83 (v.2.0.1). By assigning appropriate weights to every predicting method, we synthesized all predicted gene structures into consensus gene models using EVidenceModeler84 (v.1.1.1). The protein-coding annotations were assessed using BUSCO72 (v.5.2.2).
For functional annotation of genes, InterProScan85 (v.5.56-89.0) was used to predict potential protein domains. For calculating gene-expression levels, low-quality RNA-seq reads were first removed using fastp86 (v.0.23.0) with parameters ‘-l 30’. Then, the filtered paired-end reads were mapped against the index of decoy sequences, which concatenated the genome to the end of the annotated transcripts, using Salmon87 (v.1.6.0) in mapping-based mode with parameters ‘-l A –validateMappings –gcBias’. Finally, gene-expression levels were quantified by counting the number of reads mapping to each transcript and calculating the transcripts per million (TPM) values.
Identification of allelic and differentially expressed allelic genes
We developed a pipeline to discern allelic genes between p-contigs and a-contigs within HiFi genome assemblies (Supplementary Fig. 31). First, allelic gene pairs were determined to be reciprocal best hits using GeneTribe88 (v.1.2.0) with the default parameters. Subsequently, the full-length sequences of genes from a-contigs underwent a BLASTN (v.2.9.0+) search against those from p-contigs. Genes with both identity and coverage values equating to 100% were classified as homozygous alleles, and all others as heterozygous alleles. In addition, genes present in a-contigs but missing in p-contigs (MIP genes) were isolated on the basis of two stringent conditions: (1) no hit when gene sequences from a-contigs were aligned to p-contigs using BLASTN (v.2.9.0+) and (2) no path found when the coding sequences (CDSs) from a-contigs were aligned to p-contigs using GMAP89 (v.2021-05-27). The other genes on a-contigs that are not classified were defined as ‘others’.
The identification of differentially expressed allelic genes was based on three standards that must be simultaneously satisfied:
-
(1)
TPMp-gene/TPMa-gene > 2 or TPMp-gene/TPMa-gene < 1/2;
-
(2)
NumReadsp-gene/(NumReadsp-gene + NumReadsa-gene) > 0.75 or NumReadsp-gene/(NumReadsp-gene + NumReadsa-gene) < 0.25;
-
(3)
NumReadsp-gene + NumReadsa-gene ≥ 10.
Identification of RGAs
The RGAugury90 pipeline was used to predict the RGAs in 133 wild rice accessions (including 129 O. rufipogon, 3 O. longistaminata and 1 O. meridionalis) and 129 cultivated rice accessions (Supplementary Table 8). RGA candidates were identified and classified into four major families on the basis of the presence of combinations of these RGA domains and motifs: RLKs, NBSs, TM-CC proteins and RLPs. Considering that RGAs tend to cluster together in the genome, tandem duplications were identified using the ‘jcvi.para.catalog’ module of the MCscan91 (Python version) pipeline, with the default parameters based on their location on the chromosomes. RGAs in each locus were categorized as singletons, pairs and clusters if they numbered 1, 2 and more than 2, respectively.
Collinear blocks for each accession relative to all others were constructed using the ‘ortholog’ tool of the ‘jcvi.compara.catalog’ module in the MCscan (Python version) pipeline. Tandem genes were integrated using the ‘mcscan’ tool of the ‘jcvi.compara.synteny’ module with the parameter ‘–mergetandem’. Then, all of the collinear blocks for each accession with all others were joined to a matrix using the ‘join’ tool of the ‘jcvi.formats.base’ module. Finally, a comprehensive RGA matrix was created by merging, sorting and deduplicating all collinear matrices using a custom script.
Identification of centromeres and telomeres
The sequences of CentO satellite repeats in rice, which had been reported previously92, were aligned against nuclear genomes using ClustalW93 (v.2.1). The generated alignment file was converted to Stockholm format using an online sequence conversion tool (http://sequenceconversion.bugaco.com/converter/biology/sequences/), which then served as the input for constructing an HMM file using the ‘hmmbuild’ function in HMMER94 (v.3.1b2). Next, a homology search was performed to identify the CentO repeats for each chromosome across all accessions using nhmmer, with an E-value threshold set to 1 × 10−5. We adopted the strategy of extracting one CentO repeat unit at every fiftieth interval on each chromosome to select some CentO repeats for a subsequent similarity comparison across genomes using MAFFT95 (v.7.490). Subsequent phylogenetic analysis was performed using IQ-TREE96 (v.1.6.12), incorporating a bootstrap value of 1,000 for robustness.
To identify the telomere sequences in each chromosome, the telomere sequence 5′-CCCTAAA-3′ and the reverse complement of the seven bases were searched directly by the custom script.
TE annotation
TEs of 149 chromosome-level genomes were identified by the EDTA97 (v.2.1.0) pipeline, using both a manually curated rice TE library (rice6.9.5.liban) and the annotated CDSs of each genome. Then, individual non-redundant TE libraries, generated from each genome, were combined with the curated TE library (rice6.9.5.liban) by panEDTA37, leading to the formation of a comprehensive pangenome TE library. This pangenome TE library was then used to reannotate whole-genome TEs in our study’s 149 assemblies, as well as in 28 rice assemblies from a previously published pangenome of 33 cultivated rice accessions16 (excluding Oryza barthii, Oryza glaberrima, aus, basmati and WSSM) using RepeatMasker (http://repeatmasker.org) (v.4.1.2). The insertion time of each intact Gypsy and Copia element was estimated using LTR_retriever98 (v.2.9.0) with default parameters.
In a previous study37, the dynamics of the LTR family were determined by comparing family size and the ratio of solo-to-intact LTRs within each family between two groups. Families showing significant differences in both sizes and the solo-to-intact LTR ratio were categorized as ‘removal families’. We classified cases in which only the family size was significantly different as ‘amplification families’. On the other hand, if there was a notable change in the solo-to-intact LTR ratio without a corresponding shift in family size, these were classified as ‘balanced families’. Families that did not exhibit a significant difference in either dimension were termed ‘drifting families’. To classify the dynamics of Or-IIIa-larger Gypsy families, we first identified solo Gypsy elements using the ‘solo_finder.pl’ script from the LTR_retriever package, and obtained family information for the intact Gypsy elements from each genome’s final annotation results. Then, the solo-to-intact ratio was calculated by dividing the number of solo elements by the number of intact elements within the Gypsy families. Finally, we applied Student’s t-tests to compare the family size and solo-to-intact ratio between Or-IIIa and japonica groups, with P < 0.01 as the cut-off for significance.
We mapped the sequences of identified insertions and deletions to the comprehensive pangenome TE library using BLASTN (v.2.12.0+). If both the identity and the coverage reached 80%, the PAV was defined as a TE insertion polymorphism (TIP)99. To identify the genes adjacent to the Or-IIIa-larger Gypsy families, we mapped the gene sequences against the TIP sequences containing the Or-IIIa-larger Gypsy families in each Or-IIIa genome. Genes with an identity of at least 95% and a coverage of at least 50% were classified as adjacent to these families. Gene Ontology (GO) enrichment analysis of these genes was performed in the R package clusterProfiler (v.4.6.2), with P ≤ 0.05 as the threshold for significance.
SV calling
We adopted three strategies to detect PAVs (large insertions and deletions) in the 133 Hifi genomes. (1) We mapped HiFi reads to the IRGSP-1.0 reference genome using pbmm2 (https://github.com/PacificBiosciences/pbmm2/) (v.1.4.0) with the ‘–preset CCS’ parameters. Then, pbsv (https://github.com/PacificBiosciences/pbsv) (v.2.6.2) was used for variant calling of each accession with the parameters ‘–min-sv-length 30 –max-ins-length 100K –max-dup-length 100K’. (2) We mapped HiFi reads to the IRGSP-1.0 reference genome using minimap2 (ref. 100) (v.2.21-r1071) with the ‘-x map-hifi’ parameters. CuteSV101 (v.1.0.11) was then used for variant calling of each accession with the parameters ‘–min_support 3 –min_size 30 –max_size 100000 –max_cluster_bias_INS 1000 –diff_ratio_merging_INS 0.9 –max_cluster_bias_DEL 1000 –diff_ratio_merging_DEL 0.5’. (3) We mapped assembled contigs to the IRGSP-1.0 reference genome using minimap2 (ref. 100) (version 2.22-r1071) with the parameters ‘-x asm5 –cs -r 2k’. For variant calling, SVIM-asm102 (v.1.0.7) was operated in ‘haploid’ mode, with the parameters ‘–min_sv_size 30 –max_sv_size 100000’. For obtaining high-confidence variations, we merged all insertions within a 50-bp range and deletions within a range of 50% of their length using SURVIVOR103 (v.1.0.6). We reported only those variants that were corroborated by at least two of the calling methods and for which there was a consensus on the variant type. To detect PAVs in 16 nanopore genomes, we tailored our approach by using the second and third strategies of the above method, with some modifications to suit the characteristics of nanopore data: (1) the minimap2 (ref. 100) (v.2.21-r1071) parameters for mapping nanopore reads were adjusted to ‘-x map-ont’; (2) the parameters for CuteSV101 (v.1.0.11) were modified to increase the minimum support threshold to 10; and (3) the minimum supporting caller number in SURVIVOR103 (v.1.0.6) was adjusted to one.
To detect translocations and inversions, we first aligned each pseudo-chromosome to the reference genome across 149 genomes and then used the SyRI74 (v.1.4) pipeline for variation calling. Per the classifications provided by SyRI, INV variants were categorized as inversions, in comparison with the Nipponbare reference. Both TRANS and INVTR variants were categorized as translocations. To detect small indels, we extracted INS and DEL variants consisting of fewer than 30 bp.
SNP calling and annotation
We implemented both read-mapping-based and assembly-based approaches to identify SNPs using Nipponbare as the reference genome. For read mapping, we called SNPs through Longshot104 (v.0.4.1), using the alignment results of reads by minimap2 (ref. 100) during the process of PAV calling. Parameters were set to ‘-c 3 -D 3:10:50’ for HiFi genomes and ‘-c 10 -D 3:10:50’ for nanopore genomes. For assembly-based calling, we first aligned each contig to the reference genome using the ‘nucmer’ program and refined the alignments to one-to-one matches using the ‘delta-filter’ program. SNPs were then identified using the ‘show-snps’ program with the ‘-C -I’ parameters, all from the MUMMER package66 (v.4.0.0beta2). To minimize false positives, we only considered SNPs detected by both methodologies.
From a population of 280 accessions, including 132 long-read sequencing genomes sourced from published studies (Supplementary Table 16), SNP calling was performed by mapping the assembled contigs using MUMMER66 alone. We merged the resulting SNP datasets for each sample using a custom Perl script. To compile a high-confidence SNP dataset, we used the ‘VariantFiltration’ function in the Genome Analysis Toolkit105 (v.4.1.4.0) with the ‘–cluster-window-size 10 –cluster-size 3’ parameters. This dataset served as the basis for further evolutionary analysis. Finally, we annotated and predicted the effects of our identified SNPs using SnpEff106 (v.55.0), to ensure a comprehensive understanding of their potential impact. The same method was also applied to the population of 510 samples (Supplementary Table 17).
Pangenome construction
We performed all-versus-all CDS alignment in the pangenomes for 16 O. sativa accessions, 129 O. rufipogon accessions and a combined set of 145 wild–cultivated rice accessions (16 O. sativa and 129 O. rufipogon) using BLASTN (v.2.2.18). If a gene was aligned with at least 95% identity and at least 50% coverage, it was considered present in the corresponding genome. On the basis of their frequency, we classified genes into the following four categories: core (those present in all individuals), soft-core (those present in more than 90% of samples but not all), dispensable (those present in more than one but less than 90% of samples) and private (present in only one accession). To achieve a balanced comparison, we incorporated 113 non-redundant cultivated rice genomes from 3 previously published pangenomic datasets to match the size of the wild rice population (Supplementary Table 10). Gene annotation was performed uniformly across all samples using a consistent methodology. In the dataset comprising 129 O. rufipogon and 129 O. sativa accessions, genes present exclusively in wild rice and absent in all cultivated rice were defined as wild-rice-specific genes.
To construct three distinct pangenome graphs for our study, we applied the Minigraph-Cactus pipeline107 to the assembled genomes of 16 O. sativa, 129 O. rufipogon and a combined set of 145 accessions. The first step involved using minigraph108 (v.0.19-r551) to develop a primary pangenome graph, capturing the SVs within the input assemblies. Subsequently, these assemblies were remapped onto the primary graph using minigraph108. The mapping results were then used as the input for Cactus109 (v.2.2.1), which facilitated the generation of the final graphs. We defined the graph size as the total length of all nodes, and nodes that were not included in the reference genome (non-ref) were defined as novel sequences. To call variants for 142 accessions from our study (the remaining 6 samples lacked next-generation data) and 407 newly sequenced samples (33 O. sativa and 374 O. rufipogon or Oryza nivara) from another study24 (Supplementary Table 14), the Illumina short paired-end reads from each accession were mapped against the graph-based cultivated–wild pangenome using vg giraffe34 (v.1.43.0). The variations were then called using DeepVariants110 (v.1.6.1) with the NGS model, and all individual variants were merged using GLnexus111 (v. 1.4.1-0-g68e25e5).
Phylogenetic tree construction
Chloroplast genomes are very conserved across different species and are frequently used to construct phylogenetic evolutionary trees, which can be instrumental in studying species classification and understanding their evolutionary relationships112. To assemble chloroplast genomes of wild rice in our study, the HiFi reads were first aligned to the reference chloroplast genome of Nipponbare (Gene Bank ID: GU592207.1) using minimap2 (ref. 100) (v.2.21-r1071) with the ‘-x map-hifi’ parameters. Chloroplast-derived reads with higher than 70% coverage were then extracted using a custom Perl script. The final assembly of the chloroplast genome was then performed using hifiasm61 (v.0.16.0) with default parameters. Locally collinear blocks among 72 assembled chloroplast genomes, along with published chloroplast genomes of O. barthii (KF359904.1), Oryza glumipatula (NC_027461.1), O. longistaminata (NC_027462.1), O. meridionalis (OV049999.1), O. rufipogon (NC_017835.1) and Nipponbare, were identified for constructing multi-sequence alignments using HomBlocks113. The phylogenetic tree was then constructed using IQ-TREE96 (v.2.2.0.3) with 1,000 bootstraps. On the basis of the results, we re-identified three accessions of O. longistaminata and one of O. meridionalis.
We performed an all-versus-all comparison of the amino acid sequences of protein-coding genes using DIAMOND114 (v.2.0.15). These genes were from 149 genome assemblies and 31 assemblies (excluding the same species NIP and WSSM) from a cultivated rice pangenome16. The alignment results were then input into OrthoFinder115 (v.2.5.4) to find orthogroups and orthologues. Using 844 identified single-copy orthologues, we constructed a gene-based maximum-likelihood phylogenetic tree using IQ-TREE96 (v.2.2.0.3) with 1,000 bootstraps.
To determine the phylogenetic relationships of three populations, including wild rice and cultivated rice (Supplementary Tables 14, 16 and 17), we first converted the SNP VCF files into tfam format using PLINK116 (v.1.90b6.9 64-bit). After this, a kinship matrix was generated using EMMAX117 (v.beta-07Mar2010) with the ‘-v -h -s -d 10’ parameters. The neighbour-joining phylogenetic tree was then constructed using the PHYLIP package (https://phylipweb.github.io/phylip/) (v.3.66). For visualizing the resulting phylogenetic trees, the interactive tool iTOL118 was used.
Archetypal analysis
We first identified core SNP subsets of three populations, including wild rice and cultivated rice (Supplementary Tables 14, 16 and 17), each exhibiting a minor allele frequency of more than 0.05 and a missing rate of less than 0.8, using VCFTools119 (v.0.1.16). Further refinement was done using PLINK116 (v.1.07) to exclude SNPs with substantial LD (r² ≥ 0.5) in each sliding window (in windows of 100 SNPs within steps of 10 SNPs). Archetypal analysis120 of these SNP sets was performed with the parameters ‘-tolerance 0.0001 –max_iter 400’.
Calculations of π, F
ST, LD and DST
The nucleotide diversity (π) of each group and the fixation index (FST) between different groups were both estimated using VCFTools119 (v.0.1.16) with a window size of 100 kb and a step size of 10 kb. The genome-wide LD decay pattern for each group was calculated using PopLDdecay121 (v.3.42) and plotted using the Plot_MultiPop.pl script in the PopLDdecay package with parameters ‘-bin1 500 -bin2 7000 -break 5000’. DST was calculated using PLINK116 (v.1.07) with the ‘–genome’ and ‘–genome-full’ options. Heat plots of 1-DST matrices were made with the ggplot2 package in R (v.4.1.3).
Demographic history inference
We used MSMC2 (ref. 122) (v.2.1.4) to infer the population separation history. Our analysis began with the preparation of a negative mask file for the coding region of IRGSP-1.0 (MSU7.0) and a mappability mask file using seqbility (http://lh3lh3.users.sourceforge.net/snpable.shtml) (v.20091110) and makeMappabilityMask.py. The phased SNP sites with uniquely mapped reads and mean coverage depths greater than threefold were acquired using Longshot104 (v.0.4.1) and the high-quality regions of each genome were acquired using the filtered results of show-snps from MUMmer66 (v.4.0.0beta2). The MSMC2 input files were constructed by merging VCF and mask files using the ‘generate_multihetsep.py’ script. Because O. rufipogon naturally uses both cross-pollination and self-pollination, we followed an established approach of constructing pseudodiploids, which has been widely used in similar studies of inbreeding species such as Caenorhabditis123, Arabidopsis thaliana124, soybean125 and African wild rice126,127. We randomly selected four samples from each population and treated each sample as a single haplotype. We then paired chromosomes from haplotypes within the same population to construct pseudodiploids. The population split inference focused on 2 individuals (4 haplotypes) per group, calculating median population split times based on 50 random combinations for each comparative analysis. A mutation rate of 8.09 × 10−9 per site per generation128 and a generation time of one year were applied to estimate demographic history.
Introgression analysis
We used TreeMix129 (v.1.13) to infer population admixture graphs for major groups of O. rufipogon (Or-IIIa, Or-Ia and Or-Ib) and O. sativa (japonica, basmati, indica, intro-indica and aus) from East Asia and South Asia. Oryza officinalis (CC genome), O. longistaminata and O. meridionalis were set as outgroups to construct the phylogenetic tree. We systematically varied the number of migrations from 0 to 10, performing 10 iterations. For each migration event, TreeMix was executed by randomly sampling approximately 80% of the SNP loci using a random seed, applying the ‘-global -k 500’ parameters for global allele frequency estimation. The optimal number of migration edges (m = 4) was determined using the R package OptM130 (v.0.1.6).
To detect potential admixture events of the form (target; source 1, source 2), we performed an F3-admixture test using the qp3Pop program in ADMIXTOOLS131 (v.7.0.2). Under the null hypothesis that the target population is not a mixture of populations related to source 1 and source 2, the expected F3 statistic would yield a non-negative mean. A negative mean of the F3 statistic, on the other hand, would suggest admixture in the target population, with genetic contributions from groups related to source 1 and source 2. A z-score below −3 was considered indicative of significant admixture in population C.
Using a four-taxon model (((P1, P2), P3), PO), we calculated the D-statistic to perform the ABBA–BABA test, using the script calculate_abba_baba.r (https://github.com/palc/tutorials-1/tree/master/analysis_of_introgression_with_snp_data/src). With O. longistaminata designated as the outgroup, our analysis revealed a significantly positive D-statistic (P < 0.01), suggesting introgression between P3 and P2. To delve deeper into introgression segments between indica and aus from japonica, we computed the fd statistic across the genome in 100-kb sliding windows with a step size of 10 kb, using the script ABBABABAwindows.py from genomics_general toolkit (https://github.com/simonhmartin/genomics_general). The minimum number of SNPs per window was set to 250, and the minimum proportion of samples genotyped per site was set to 0.4. The fd < 0 values are converted to zero, and fd > 1 values are converted to 1. Finally, to assess the congruence of introgression regions between indica and aus from japonica, we catalogued the putative introgression segments within the top 10, 30 and 50 100-kb windows.
Pairwise differentiation comparison
To quantify the genetic similarities between the two groups, we performed a comprehensive analysis of all possible pairwise combinations of varieties. We focused on identifying ‘same windows’, defined as those with a similarity exceeding 99.99%. The similarity index for each 10-kb window was calculated using the following formula:
$${\rm{S}}{\rm{imilarity}}=(\mathrm{10,000}-{{\rm{Num}}}_{{\rm{diff}}}-{{\rm{Num}}}_{{\rm{nan}}})/\mathrm{10,000}.$$
Here, Numdiff is the number of differing SNPs, and Numnan is the number of sites with missing data within each window. This methodology was also applied to predict potential introgression fragments between two taxa. For this purpose, we selected a representative variety from each group and mapped out the count of different SNPs (that was termed as pairwise differentiation) across 10-kb non-overlapping windows within specified regions.
Identification of genomic selective sweep
To detect selective sweeps associated with artificial selection during domestication, we calculated πwild/πcultivated and FST using VCFtools119 (v.0.1.16) with a 100-kb sliding window and a 10-kb step. After this, we used BEDTools132 (v.2.30.0) with the parameter ‘-d 30000’ to merge overlap regions that were identified within the top 5% of two values. It is worth highlighting that our analysis was restricted to Or-IIIa and Or-Ib as representatives of the wild rice groups, given the extensive gene flow observed in Or-Ia from indica. The cultivated rice category encompassed indica, japonica, aus and basmati. Phylogenetic tree construction and archetypal analysis based on the SNPs within identified selective sweeps were in keeping with the above methods for whole genomes. To identify domPAVs, we performed a two-sided Fisher’s test comparing wild and cultivated rice133, considering PAVs with a false discovery rate (FDR)-adjusted P < 0.05 as significant.
To trace the origins of domestication regions in rice, we first identified 566,513 differentiated SNPs between 31 Or-IIIa accessions and 19 Or-Ib accessions, exhibiting an allele frequency greater than 0.8. We then assessed the major alleles, which have a frequency of 90% or higher, at these SNP sites within the Asian cultivated rice groups and Or-Ia. If the major allele matched that of Or-IIIa, the SNP was classified as originating from Or-IIIa; otherwise, it was classified as deriving from Or-Ib. The whole-genome distribution was visualized using RectChr (https://github.com/BGI-shenzhen/RectChr) (v.1.36).
Haplotype analysis and construction of phylogenetic trees of domestication genes
To construct phylogenetic trees of domestication genes, we selected 11 representative genes known for their roles in the early stages of rice domestication, as documented in published literature. Using GMAP89 (v.2021-05-27), we extracted gene region sequences or gene regions along with their specified upstream and downstream regions from the 280 rice accessions in this study. These sequences were then aligned using MAFFT95 (v.7.490), applying the parameter ‘–maxiterate 1000’ to optimize alignment accuracy. For phylogenetic tree construction, we extended model selection followed by tree inference with 1,000 bootstrap replications using IQ-TREE96 (v.2.2.0.3), and set O. officinalis (CC genome), O. longistaminata, O. meridionalis and O. glaberrima as outgroups. In the haplotype analysis phase, we filtered the intron sequences without candidate functional sites or QTNs for haplotype analysis and visualized haplotype networks using the R package geneHapR134 (v.1.1.9) after manually trimming the alignment sequences. Our analysis retained haplotypes with a frequency greater than two and those closely related to the major haplotype for clarity.
Identification of indica–japonica differentiated variations
We identified SNP sites with highly differentiated alleles between 90 indica cultivars and 36 japonica cultivars, requiring that they must have an allele frequency higher than 0.9 in both groups. We identified a total of 855,121 indica–japonica differentiated SNPs. To ascertain the ancestral origin of these SNPs, we analysed the states of major alleles (frequency ≥ 0.6) in their respective ancestral groups and divided them into six categories (Supplementary Table 20). We also applied the same standard to identify 13,853 indica–japonica differentiated PAVs between 26 indica cultivars and 13 japonica cultivars.
Distribution maps
The geographical records of all wild rice in the study were obtained by collecting field samples. Approximate latitude and longitude information on their distribution ranges was used for spatial mapping, and this can be found in Supplementary Table 2. The distribution map was generated using the open-source Python tool GeoPandas135 (v.0.14.4) (BSD-3-Clause licence), with base map layers derived from a public-domain Natural Earth dataset (https://www.naturalearthdata.com/).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.