Establishing Ae. tauschii genomic resources
Plant material
We compiled a database comprising 1,124 Ae. tauschii accessions with associated passport data in Supplementary Table 1 (Supplementary Note 1). Duplicate germplasm bank IDs were identified and passport data collated using the Plant Genetic Resources for Food and Agriculture (PGRFA) database (https://www.genesys-pgr.org/) or other sources as indicated in Supplementary Table 1. From this database, seed of 228 non-redundant accessions were obtained from the Open Wild Wheat Consortium Ae. tauschii Diversity Panel collection deposited at the Germplasm Resource Unit (GRU) of the John Innes Centre; 48 accessions from the Cereal Crop Wild Relatives (Triticeae) collection of the GRU; 19 accessions from the Designing Future Wheat (DFW) Wheat Academic Toolkit collection of the GRU that have been used as synthetic hexaploid wheat D genome donors; 223 accessions from the Wheat Genetics Resource Center (WGRC) of Kansas State University; 34 accessions from the Plant Gene Resources of Canada (PGRC); 84 accessions donated by the Institute of Botany, Plant Physiology and Genetics of the Tajikistan National Academy of Sciences; 20 accessions donated by the Azerbaijan National Academy of Sciences; and 37 accessions donated by Quaid-i-Azam University. Accession P-99.95-1.1 was obtained from the Deposited Published Research Material collection of the GRU.
We also resequenced and analysed 60 hexaploid wheat landraces. The list can be found in Supplementary Table 14. Out of the 60 wheat landraces, 57 were received from the International Maize and Wheat Improvement Center (CIMMYT) and 3 were from the International Center for Agricultural Research in the Dry Areas (ICARDA).
Re-sequencing of Ae. tauschii and hexaploid wheat accessions
In this study, we generated short-read WGS data for 350 Ae. tauschii accessions (Supplementary Table 1) and 59 hexaploid wheat accessions (Supplementary Table 14). We isolated DNA following the CTAB protocol described by Abrouk et al.47 from leaf tissue of 2-week-old seedlings under prior dark treatment for 48âh. DNA was quantified using the Qubit dsDNA HS Assay (Thermo Fisher Scientific) and purity was determined according to 260/280 and 260/230 absorbance ratios using a Nanodrop spectrophotometer. PCR-free paired-end libraries were constructed and sequenced on an Illumina Novaseq 6000 instrument, yielding a median 8.3-fold coverage per sample (ranging from 5.87- to 16.86-fold) for the Ae. tauschii samples and a minimum tenfold coverage for the bread wheat samples (Supplementary Tables 1 and 14). Library preparation and sequencing was performed as a service by Novogene.
Library construction and RNA sequencing
Seedlings of Ae. tauschii accessions TA10171, TA1675 and TA2576 were raised as 5â6 seeds per pot (6âÃâ6âÃâ10âcm) in a growth chamber at 22â24â°C under long-day photoperiods of 16/8âh day/night cycle with high-output white-fluorescent tubes until the third leaf stage (about 2â3 weeks old), and then transferred to a 4â°C growth chamber with a long-day photoperiod for vernalization. After a nine-week vernalization period, all the plants were moved back to the original growth chamber under the controlled conditions mentioned above. In total, 45 tissue samples were collected: From each of the three accessions, three biological replicates were taken from each of: young leaf, root, stem, flag leaf and inflorescence. Samples were collected at the same time of day at approximately 5â6âh after daylight. The seedling leaves and roots were harvested after two weeks of recovery in the original growth chamber and rinsed with water to remove soil particles. When the plants had 4â5 tillers, the stems, flag leaves and inflorescences were harvested together. The green inflorescences were collected immediately after pollination. The 5-cm-long stem sections and youngest flag leaves were measured from the top of the same inflorescences. Samples were placed in liquid nitrogen after harvest and stored at â80â°C.
The samples were ground into a fine powder in liquid nitrogen in a ceramic mortar and pestle to isolate RNA using the Qiagen RNeasy Mini Kit following the manufacturerâs protocol. The quality of RNA was determined on a 1% agarose gel, and RNA concentration was measured using a NanoPhotometer (Implen) at 260ânm and 280ânm. Sample collection time and relative details are listed in Supplementary Table 23. High-quality RNA samples were delivered for RNA integrity test, poly-A mRNA enrichment, library construction and PE100 sequencing using the Illumina NovaSeq system (Génome Québec, Canada).
PacBio HiFi genome sequencing; primary assembly of the Ae. tauschii genomes and CWI 86942
We selected 46 Ae. tauschii accessions, including 11 L1 accessions, 34 L2 accessions and 1 L3 accession. These 46 accessions were selected to span the geographical range of the species (Fig. 1a) and provide a collection of phenotypes related to disease and pest resistance, abiotic tolerance and agromorphological traits of strategic interest to the Open Wild Wheat Consortium for bread wheat improvement (Supplementary Table 4). We included a higher proportion of accessions from L2 relative to L1 based on reported phylogenies showing that L2 is more genetically diverse than L15,14,21. A single L3 accession was selected based on low genetic diversity observed among five non-redundant L3 accessions in the phylogeny reported by Gaurav et al.14. Several accessions were selected to maximize the genetic diversity based on a core subset sampling analysis using Core Hunter (v3)48, using the âaverage entry-to-nearest-entryâ distance measure, aiming to maximally represent the diversity of the panel of 242 non-redundant accessions published by Gaurav et al.14. The bread wheat landrace CWI 86942 was selected based on a high L3 k-mer content.
For the Ae. tauschii accessions, âhigh molecular weightâ genomic DNA was isolated from leaf tissue of three to four-week-old dark-treated seedlings. We followed the high molecular weight DNA isolation protocol optimized by Driguez et al.49 for long-read sequencing. DNA integrity was confirmed using the FemtoPulse system (Agilent). DNA was quantified using the Qubit dsDNA HS Assay (Thermo Fisher Scientific) and purity was determined according to 260/280 and 260/230 ratios using a Nanodrop spectrophotometer. For the bread wheat accession CWI 86942, leaves from two-week old seedlings were collected from two different plants and high molecular weight DNA extraction was performed as mentioned above49. All the library preparation and Circular Consensus Sequencing (CCS) was performed on a PacBio Sequel II instrument, as a service by Novogene.
For Ae. tauschii, HiFi reads were assembled using hifiasm (v0.16.1)50 with parameters â-l0 -u -f38â optimized for homozygous and large genomes (-l0 -f38) and to minimize misassemblies by disabling the post-join contigs step (-u), favouring accuracy over contiguity. Sequencing coverage ranged from 18 to 47-fold depending on the accession, except for the three Ae. tauschii lineage reference accessions (TA10171, TA1675 and TA2576) for which the coverage was increased to 67 – 97-fold. For assembly validation and quality control, we used QUAST (v5.0.2)51 to calculate the assembly metrics, Merqury (v1.3)52 to estimate the base-call accuracy and k-mer completeness based on 21-mer produced from the short-read WGS data14 and BUSCO (v5.3.1)17 with the embryophyta_odb10 database to determine the completeness of each genome assembly. The number of homozygous SNPs and short insertionâdeletion mutations (indels) was determined comparing the HiFi assemblies against the respective WGS data (Supplementary Table 24). They are in the range of 3,416â40,885 homozygous SNPs or indels per accession.
For CWI 86942, we performed the primary contig-level assembly with 484.33âGb of HiFi reads (32-fold coverage) using the LJA assembler (v0.2)53 with default parameters. Assembly metrics and QC were performed with QUAST (v5.0.2)51 and BUSCO (v5.3.1)17 with the embryophyta_odb10 database.
Chromosome conformation capture sequencing and chromosome-scale scaffolding
In situ Hi-C libraries were prepared for TA1675 and TA10171 from two-week-old Ae. tauschii plants according to the previously published protocol54. Libraries were quantified and sequenced (paired-end, 2 Ã111 cycles) using the Illumina NovaSeq 6000 device (Illumina) at IPK Gatersleben55, yielding 316 million paired-end reads (150âbp) for TA1675 and 215 million paired-end reads for TA10171.
For TA2576, two-week-old, dark-treated leaf tissue samples were harvested and cross-linked with formaldehyde for library preparation and Hi-C sequencing by Phase Genomics, yielding 543 million paired-end reads (150âbp). For CWI 86942, two Omni-C libraries were generated and sequenced from two-week-old, dark-treated leaf tissue samples as a service by Dovetail Genomics. The total yield was 715 million paired-end reads (150âbp).
Scaffolding into pseudomolecules for TA10171, TA1675, TA2576 and CWI 86942 was performed from their primary assemblies and their specific Hi-C and Omni-C data, respectively. Hi-C and Omni-C reads were processed with Juicer (v1.6)56 (for the Hi-C reads, parameter: -s DpnII) to convert raw fastq reads to chromatin contacts and remove duplicates. The chromatin contacts were used to scaffold the contig-level assemblies using 3D-DNA (v190716)57 (using run-asm-pipeline.sh with -r 0 parameter). Scaffolds were visualized, manually oriented and ordered using Juicebox (v2.20.00)58.
RagTag assembly of 43 Ae. tauschii accessions
The remaining 43 contig-level assemblies were scaffolded into chromosome-scale assemblies using RagTag (v2.1.0)59 and the three high-quality genomes (TA10171, TA1675 and TA2576) as anchors. In brief, the primary contig-level assemblies were scaffolded using RagTag scaffold against the respective chromosome-scale reference assemblies generated in this study. After running RagTag scaffold, the placed contigs had the exact same lengths as the primary contigs before running RagTag scaffold. Also, the number of gaps in each RagTag assembly corresponds to the number of placed contigs minus seven (number of chromosomes) (Supplementary Table 6). This indicates that RagTag scaffold did not introduce misassemblies or duplicated contig ends. The scaffolded assemblies were validated with dot-plots generated using MashMap (v3.0.6)60 against the corresponding reference assembly. While being great resources for gene discovery and comparative analyses, reference-guided assemblies are limited in their ability to study large structural rearrangements.
Repeat and gene annotation
Paired-end RNA-seq reads for TA10171, TA1675 and TA2576 were first cleaned using Trimmomatic (v0.40)61 with the following settings âILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:30 TRAILING:30 MINLEN:36â. Trimmed paired-end reads were aligned to the corresponding genome assembly using STAR (v2.7.10b)62 with the parameters â–twopassMode basic –outFilterMismatchNMax 5 –outFilterMatchNminOverLread 0.80 –alignMatesGapMax 100000 –outSAMstrandField intronMotif –runMode alignReadsâ and the results were filtered and sorted using SAMtools (v1.10)63. Then, the BRAKER (v3.0.3)64,65,66 pipeline was used to predict de novo gene models using RNA-Seq and protein data mode with the Viridiplantae protein models provided by OrthoDB (v11). Predicted gene annotations obtained from BRAKER were processed using a combination of NCBI BLAST+ (v2.9.0-2)67, AGAT (v1.2.1) (https://github.com/NBISweden/AGAT), InterProScan (v5.64-96.0)68,69, and R (v4.2.0). Outputs from BRAKER3 were first converted to gff3 and CDS and protein sequences were extracted using âagat_sp_extract_sequences.plâ from AGAT package. BLASTn was used to perform a reciprocal BLAST of the predicted CDS against themselves, and a unidirectional BLAST against the Ensembl nrTEplantsJune2020.fa repetitive elements database, using default search parameters. The putative functions for each annotated gene model were predicted using InterProScan with default parameters for the following databases: FunFam, SFLD, PANTHER, Gene3D, PRINTS, Coils, SUPERFAMILY, SMART, CDD, PIRSR, ProSitePatterns, AntiFam, Pfam, MobiDBLite, PIRSF, NCBIfam. R (v4.2.0) (in R studio) was used to visualize and filter these results. Predicted transcripts with fewer than 50 exact and fewer than 150 inexact self-BLAST results were retained. Predicted transcripts were retained from the final de novo annotation if there were (1) no exact matches to the transposable elements database, and (2) at least one domain predicted by any of: FunFam, PANTHER, Gene3D, SUPERFAMILY, ProSitePatterns, Pfam, CDD, InterPro. Predicted genes were considered as âlow confidenceâ if there were no exact matches to the database of original transcript predictions. The remaining annotated genes were considered as âhigh confidenceâ. Validation and annotation completeness was performed using agat_sp_statistics.pl and BUSCO (v5.4.7)17 running in transcriptome mode with the poales_odb10 database. We used OrthoFinder70 (v2.5.4) with default parameters to perform gene family analysis.
Repeat annotation was performed using RepeatMasker (v4.1.2-p1)71 and the Ensembl nrTEplantsJune2020 repetitive elements database72 using the RMBlast engine.
For bread wheat accession CWI 86942, gene model prediction was performed using a lifting approach similarly to the one described in Abrouk et al.73 with a combination of liftoff (v1.6.3)74, AGAT and gffread (v0.11.7)75. In brief, gene model annotations of hexaploid wheat line Chinese Spring, Kariega, Fielder, ArinaLrFor, Julius and Norin61 were independently transferred using liftoff (parameters: -a 0.9 -s 0.9 -copies -exclude_partial -polish) and all the output gff files were merged into a single file using the Perl script agat_sp_merge_annotations.pl. The merged file was then post-processed using gffread tools (parameters: –keep-genes -N -J) to retain transcripts with start and stop codons, and to discard transcripts with (1) premature stop codons, and/or (2) having introns with non-canonical splice sites. In total, 147,646 gene models were predicted for which the putative functional annotations were assigned using a protein comparison against the UniProt database (2021_03) using DIAMOND (v2.1.8)76 (parameters: -f 6 -k1 -e 1e-6). PFAM domain signatures and GO were assigned using InterproScan (v5.55-88.09)68,69. The BUSCO score showed a completeness of 99.2% (96.4% duplicated) with the poales_odb10 database17.
k-mer matrix generation, redundancy and diversity analyses
k-mer matrix generation
We developed an optimized k-mer matrix workflow to generate a presence/absence k-mer matrix for large diversity panels (Supplementary Note 2) (https://github.com/githubcbrc/KGWASMatrix). We counted k-mers (kââ=ââ51) in raw sequencing data for 350 accessions generated in this study, 306 accessions published by Gaurav et al.14, 275 accessions published by Zhou et al.15 and 24 accessions by Zhao et al.6. The 35 accessions with less than fivefold sequencing coverage were discarded to avoid affecting the k-mer count. k-mers with a count of one were discarded prior to generating the k-mer matrix. k-mers were retained in the k-mer matrix by a minimum occurrence of 6 across accessions and a maximum occurrence of (Nâââ6), where N is the total number of accessions.
Redundancy analysis
A redundancy analysis was performed using a subset of 100,000 random k-mers sampled from the k-mer matrix of 920 Ae. tauschii accessions. The complete matrix contained 10,078,115,665 k-mers. The pairwise comparisons between accessions were performed by computing the sum of the presenceâabsence values (0 and 1) per k-mer between 2 accessions of the matrix. To determine the divergence, we computed the total number of 1 present in the summed string, each one corresponding to a difference in the presence/absence of the k-mer in the 2 compared accessions. In the sum, the presence of a k-mer in two accessions would result in a 2 and the absence in both accessions in a 0. A threshold of 96% shared k-mers was used to call redundancy based on control lines determined by Gaurav et al.14 to be genetically redundant based on a SNP analysis.
Estimation of the genetic diversity in the 46 Ae. tauschii accessions selected for high-quality genome assemblies
We computed the k-mer accumulation across the 46 Ae. tauschii accessions by analysing their k-mer presence or absence in the k-mer matrix. First, we extracted a k-mer sub-matrix for the 46 accessions and removed k-mers that were absent from all accessions. The k-mer accumulation was computed by counting the number of k-mers present in the first accession, then adding new k-mers from the second accession (that is, not present in the previous accession) and sequentially adding new k-mers until accession 46. This computation was iterated 100 times using randomly shuffled sub-matrices, and the mean and standard deviation were calculated. The mean cumulative k-mer counts were fitted to a logarithmic function (yâ=âaâ+âbâÃâlog(x)] using the Python function optimize.curve_fit from SciPy library (v1.8.0)77. The fitted data were plotted using the Python seaborn library (v0.11.2)78 to visualize the k-mer-based accumulation curve. We calculated the k-mer frequency across the full panel of 920 Ae. tauschii accessions in comparison to the genetic diversity in the 46 accessions. The k-mers were divided into two groups: k-mers present and absent in the 46 accessions. We extracted k-mer sub-matrices for each group and computed the occurrence of the k-mers across the 920 accessions (Extended Data Fig. 1c). We plotted the square root transformation of the k-mer frequency using the Python seaborn library (v0.11.2)78.
SNP calling
Fastq raw reads were trimmed using Trimmomatic (v0.38)61 with the following settings âILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36â. Cleaned reads were mapped on the TA1675 assembly using BWA mem (v0.7.17)79 and sorted with SAMtools (v1.8)63. Variants were called using BCFtools mpileup (v1.9)63 with the setting â-C 60 -q 5 -Q 20â, and only SNPs were retained as variants. The filtering was performed using BCFtools, retaining only sites with a maximum depth of 40, a quality higher than 100 and an allele count higher than 1. For quality check, we counted the percentage of divergent sites using re-sequencing data from TA1675 against the chromosome-scale TA1675 reference assembly, revealing an error rate of 0.18%. For the assessment of assembly quality (Supplementary Table 24), homozygous indels were also retained (maximum number of raw reads supporting an indel (IDV)â=â3, maximum fraction of raw reads supporting an indel (IMF)â=â0.3, depth between 5 and 40, a quality higher than 30 and an allele count higher than one). We further computed the SNP density across the chromosomes and calculated allele frequency (Extended Data Fig. 8aâc and Supplementary Tables 25â27). In total, 957 Ae. tauschii and 59 wheat landraces accessions reached the quality threshold of a coverage higher than fivefold after trimming and were used for SNP calling. The SNP data set was used for phylogenetic, ancestry, and nucleotide diversity analyses.
The phylogenetic tree was built from the filtered SNPs using vcfkit (v0.1.6)80 with the UPGMA algorithm. Ancestry analysis was performed using the sNMF (Fast and Efficient Estimation of Individual Ancestry Coefficients) approach available in the LEA R package (v3.10.2)81. For each run, we performed 20 repetitions using the following parameters âalpha = 10, tolerance = 0.00001, iterations = 300â up to Kâ=â28. Supplementary Table 28 shows the minimum cross-entropy values for 20 sNMF runs across different values of K.
We estimated the nucleotide diversity (Ï) in the 493 non-redundant accession of Ae. tauschii using the filtered SNP calls against the TA1675 reference assembly. We calculated Ï over 10-kb windows of the genome using VCFtools (v0.1.16)82 (parameter –window-pi 10000).
Structural variant calling
We determined the structural variation across the 46 high-quality assemblies with reference to the chromosome-scale assembly of TA1675. Structural variants of >50âbp in length and up to 100âkb were called using the PacBio structural variant calling and analysis suite (pbsv) (v2.9.0) and following the pipeline described at (https://github.com/PacificBiosciences/pbsv). In brief, HiFi sequencing reads in bam format were aligned to the reference genome using pbmm2 aligner (v1.10.0) (https://github.com/PacificBiosciences/pbmm2). The bam file was indexed as CSI suitable for larger genomes. Signatures of structural variation were detected and structural variants were called per accession in vcf format, then concatenated into a single bed file per lineage.
The Ae. tauschii genomes facilitate gene discovery
k-mer-based genome-wide association in Ae. tauschii
We followed the k-mer GWAS (kGWAS) pipeline described by Gaurav et al.14 using the Python scripts available at (https://github.com/wheatgenetics/owwc/tree/master/kGWAS) and the phenotype data for stem rust and leaf rust available for this panel to specifically run the association mapping and plotting using default parameters. The association mapping analyses showing the effect of assembly quality in Ae. tauschii accession TA1662 were performed using previously published phenotype data for reaction to Pgt race QTHJC14. The kGWAS for leaf rust to identify Lr39 in Ae. tauschii accession TA1675 was performed using phenotype data for reaction to Pt race BBBDB (Supplementary Table 29).
SrTA1662 haplotype analysis
To identify the SrTA1662 locus in the contig-level assembly of Ae. tauschii accession TA1662, we performed a BLASTn (v2.12.0)67 search of the SrTA1662 gene sequence (GenBank ID MW526949.1) published by Gaurav et al.14. To identify the SR33 locus in the contig-level assembly of Ae. tauschii accession CPI 110799 (the original source of Sr33), we searched for the RGA1e (also known as Sr33) gene sequence (GenBank ID KF031291.1) published by Periyannan et al.24. RGA1e gene sequences were also searched against the contig-level assemblies of accessions AUS 18911 (KF031299.1) and AUS 18913 (KF031284.1)24. For all accessions, the genes were found within a single contig that located to the chromosome arm 1DS based on the scaffolding against the TA1675 reference assembly. In the four accessions, we performed BLASTn searches for additional Ae. tauschii resistance gene analogues (RGA1a-d, RGA2a-b, RGA3a) reported by Periyannan et al.24 (GenBank ID KF031285.1âKF031299.1). To confirm that this region is orthologous to the Mla locus in barley, we searched for the presence of the pumilio (Bpm) gene homologue and subtilisin-chymotrypsin inhibitor (CI) genes in gene-lifting annotations for AUS 18911, AUS 18913, CPI 110799 and TA1662. Bpm and CI genes were previously reported flanking Resistance Gene Homologues (RGH) of the Mla locus83. The gene-lifting annotations were generated using liftoff v1.6.174 with default parameters based on the TA1675 genome annotation.
Phylogenetic analysis of RGAs in Ae. tauschii
To provide further evidence for the homology of the SrTA1662 (SR66) and SR33 loci in Ae. tauschii and the Mla locus in barley, we performed phylogenetic analyses of RGA and RGH gene sequences. Clustal algorithm with default parameters was used for the DNA sequence multiple alignment. We used the unweighted UPGMA algorithm with bootstrap testing to support the tree topology with 5,000 replicates. The phylogenetic analyses were performed using MEGA (v11)84,85.
Leaf rust inoculations and association studies
The evaluation of resistance and susceptibility in 149 Ae. tauschii L2 accessions was conducted against the North American Pt race isolate BBBDB 1-186 using seedlings organized in cone racks. Every cone rack housed 98 cones, and each cone was sown with three seeds. The primary leaves of seedlings, aged 8â9 days, were subjected to inoculation by distributing 1âml of inoculum per cone rack, which consisted of 15âmg of spores in 1âml of Soltrol 170 (Chevron Phillips Chemical Company). The delivery to each plant was 0.05âmg of urediniospores. Post-inoculation, the phytotoxicity from the oil carrier, Soltrol 170, was mitigated by mildly fanning the leaves for 2âh under the illumination of 400-Watt HPS bulbs to expedite the evaporation of the carrier oil. The seedlings, once inoculated, were placed in mist chambers maintained at 22â°C, where 100% humidity was sustained using a domestic ultrasonic humidifier for a period of 16â18âh in the absence of light. Subsequently, the seedlings were transferred to a greenhouse with a 16-h day cycle, maintaining nocturnal and diurnal temperatures at 15â°C and 20â°C, respectively. The phenotypic assessment of disease was undertaken at 10 and 12 dpi using an infection type scoring range of 0 to 3â+â, as standardized by Long and Kolmer87, and depicted as the mean of three individual replicates per accession (Supplementary Table 29).
For use in GWAS, the qualitative scores were converted to a quantitative score by assigning numerical values to the infection types. This was achieved by the kGWAS pipeline (https://github.com/wheatgenetics/owwc/tree/master/kGWAS) that performs Stakman IT to numeric scale 1 conversion (RunAssociation_GLM.py with -st parameter).
Bi-parental mapping of LR39 and candidate gene identification
An Ae. tauschii bi-parental mapping population (nâ=â123) was generated by crossing the leaf rust resistant Ae. tauschii accession CPI 110672 (synonymous TA1675) with the leaf rust susceptible accession CPI 110717. The mapping population was segregating for a single dominant leaf rust resistance gene (Pâ=â0.606) when inoculated with the Australian Pt isolate 26-1,3 (PBI culture no 316) and phenotyped at the Plant Breeding Institute, Cobbitty30. Bulk segregant analysis of selected homozygous resistant and susceptible F2 progenies with the 90âK SNP array88 placed the leaf rust resistance locus to chromosome 2DS. The mapping population was further genotyped with markers derived from the 90âK iSelect SNP array, the TA1675 genomic sequence, and a marker closely linked to LR39 (Xgdm35) (Supplementary Table 30 and 31)28,89. Linkage analysis was performed using MapDisto (v2.0)90 with default parameters such as LOD (logarithm of the odds) threshold of 3.0, maximum recombination frequency of 0.3 and removal of loci with 10% missing data. Genetic distances were calculated using the Kosambi mapping function, and the map was created using MapChart (v2.32)91. Markers flanking LR39 were anchored to the TA1675 reference assembly. Annotated high-confidence genes at the delimited physical interval were screened for protein homology using BLASTp to identify diversity between TA1675 and AL8/78 (Supplementary Table 13). The conserved domains and critical residues of WTK and NLR were identified using the amino acid sequences in the NCBI Conserved Domain search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and InterPro (https://www.ebi.ac.uk/interpro/search/sequence/) databases. The polymorphic SNP corresponding to R457I in WTK was converted to a KASP marker diagnostic for Lr39 (Supplementary Table 31).
Virus-induced gene silencing
To develop candidate gene-specific VIGS probes, the predicted coding sequences of candidate genes were searched against the TA1675 transcriptome database using siRNA-Finder (siFi21) software (v1.2.3-0008)92. Based on the RNA interference (RNAi) design plot, regions predicted to have a higher number of efficient probes and fewer off-targets were used for designing silencing probes for the WTK (LrSi2:258âbp, LrSi6:254âbp and LrSi7:248âbp) and the NLR (LrSi3:234âbp and LrSi8:257âbp) candidate genes. The silencing probe sequences were verified for specificity using a BLAST search against the TA1675 reference assembly (<80% sequence identity for hits other than the target candidate). Designed probes were flanked by XbaI and ApaI and synthesized at GenScript Biotech followed by cloning into the BSMV-γ vector in an antisense direction. The resulting constructs were transformed into Agrobacterium tumefaciens strain GV3101. The Agrobacterium clones were grown overnight at 28â°C in lysogeny broth with appropriate antibiotics. Cells were collected by centrifugation at 3,500g for 10âmin, then re-suspended using infiltration buffer (10âmM MgCl2, 10âmM MES pH 6.5 with KOH buffer and 150âmM acetosyringone) and adjusted to an OD600 of 1.0 followed by incubation at 28â°C for 3âh. Equal volumes of BSMV-α and BSMV-β were mixed with respective BSMV-γ silencing probes or BSMV-γGFP and infiltrated into Nicotiana benthamiana leaves. Infiltrated leaves were collected 5 days after infiltration and homogenized with virus inoculation buffer (10âmM monopotassium phosphate containing 1% Celite (Thermo Fisher Scientific, 68855-54-9)). The homogenate containing viral particles was rub inoculated onto five to ten seedlings of TA1675 at the three leaf stage. After two weeks of recovery and viral symptoms appearing, the seedlings were inoculated with Pt isolate B9414. Prior to inoculating TA1675, Pt isolate B9414 was propagated on seedlings of the susceptible wheat cultivar Thatcher. Freshly collected urediniospores were suspended in Isopar L and sprayed onto plants using a high-pressure air sprayer. After inoculation, plants were placed in the dark overnight in an incubation box equipped with a humidifier and then transferred to a growth chamber with a 16/8âh day/night cycle, with 21â°C/18â°C growth conditions. Leaf rust phenotypes were recorded at 12 days after inoculation by scanning the leaves at 600âdots per inch on an Epson Perfection V850 Pro scanner. For leaf rust biomass quantification, DNA was extracted from Pt-inoculated leaves using the CTAB method47. DNA concentrations were measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific). A 20âµl qPCR reaction containing Power SYBR Green PCR Mix (Applied Biosystems 4367659), ~25âng of DNA, and primers specific to the Puccinia 28âS large subunit or the internal transcribed spacer region93 and Triticeae elongation factor-specific primers94 was run using the ABI QuantStudio 6 Flex Real-Time PCR machine. The 2âÎÎCT method was used to normalize rust gene amplification values relative to the Ae. tauschii elongation factor endogenous control.
PCR conditions
A 20âμl PCR containing 100âng of genomic DNA, 1X GoTaq Flexi green buffer, 1.5âmM MgCl2, 200âμM dNTP, 200ânM primers and 1 unit of Taq polymerase (M829B, Promega) was used for various fragment amplifications. Primer sequences are shown in Supplementary Table 31. A touchdown PCR protocol was used as follows: initial denaturation at 94â°C for 30âs; annealing at 65â°C for 30âs, decreasing by 1â°C per cycle; and extension at 72â°C for 60âs, followed by repeating these steps for 14 cycles. After enrichment, the program continued for 29 cycles as follows: 94â°C for 30âs, 58â°C for 30âs, and 72â°C for 60âs. PCR products of cleaved amplified polymorphic sequence (CAPS) markers were digested with appropriate restriction enzymes by following the manufacturerâs instructions. A 5âμl reaction (2.5âμl of KASP Master Mix (Low ROX KBS-1016-016), 0.07âμl of assay mix and 2.5âμl (25âng) of DNA) was used for KASP markers. PCR cycling was performed in an ABI QuantStudio 6 Flex Real-Time PCR machine as follows: preread at 30â°C for 60âs; hold stage at 94â°C for 15âmin; and then ten touchdown cycles (94â°C for 20âs; touchdown at 61â°C, decreasing by 0.6â°C per cycle for 60âs), followed by 29 additional cycles (94â°C for 20âs; 55â°C for 60âs). The plates were then read at 30â°C for endpoint fluorescent measurement.
Tracing lineage-specific Ae. tauschii haplotype blocks in the bread wheat genome
Missing link finder pipeline
We generated canonical 51-mers for each of the 82,293 genotyped wheat accessions from Sansaloni et al.35 using their respective DArTseq markers and Jellyfish (v 2.3.0)95. For each accession, the k-mers were sorted and stored as text files. From the k-mer matrix available from Gaurav et al.14, k-mers present only in Ae. tauschii L3 were extracted, sorted, and stored as text files. Pairwise comparisons of the sample-specific k-mers from the 82,293 wheat accessions and the L3-specific k-mers were performed using the comm bash command. Jaccard indices were computed with the following formula, where A is the set of k-mers from a single accession and L is the L3-specific k-mer set.
$$J(A,L)=\frac{A\cap L}{A\cup L}$$
The script is available on Github (https://github.com/emilecg/wheat_evolution).
Determining the extent of L3 in wheat lines using whole-genome re-sequencing data of 59 hexaploid wheat landraces
Raw reads were trimmed using Trimmomatic (v0.38)61 with the following settings âILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36â. KMC (v3.1.2)96 was used to generate 31-mer sets for the 59 resequenced wheat landraces (Supplementary Table 14). IBSpy (v0.4.6)13 was run with TA2576 as a reference and the bread wheat landraces as queries with a k-mer size of 31 and a window size of 50,000âbp as parameters. A variation score threshold of â¤150 was used to determine how many windows were in common between the L3 reference and the wheat landraces. IBSpy variation values of â¤150 were determined to be optimal to account for the relatively low intra-lineage variation present in L3 (Extended Data Fig. 9 and Supplementary Table 32). The percentage of matching 50-kb windows was used as a proxy to determine the extent of introgression in the landraces.
Differences in genes in the 135-Mb L3 introgression block on chromosome 1D
The protein sequences of genes annotated in the interval of the introgression on the Kariega genome and on the CWI 86942 genome were compared using DIAMOND and visualized with the Persephone genome browser. In case genes were annotated in both the genomes, their amino acid sequences were aligned using the NeedlemanâWunsch algorithm to determine the percentage of identity. The absence of genes in one of the two annotations was investigated manually with the BLAST algorithm integrated into Persephone. Annotated genes found to be part of transposable elements were excluded from the analysis.
Presence of the 135-Mb L3 haplotype block on chromosome 1D in wheat landraces
The presence of the 135-Mb L3 haplotype block was manually confirmed in 12 out of the 126 wheat landraces (Supplementary Table 16). CWI 86942 and another Georgian landrace (CWI 86929) had the largest block size (Extended Data Fig. 6d).
To further determine how widespread the presence of the chromosome 1D L3 segment was, we downloaded the IBSpy variation file from 1,035 hexaploid wheat accessions (827 landraces and 208 modern cultivars) and L3 line BW_01028 (https://opendata.earlham.ac.uk/wheat/under_license/toronto/WatSeq_2023-09-15_landrace_modern_Variation_Data/IBSpy_variations_10WheatGenomes/) against the Chinese Spring RefSeq v1.0 assembly38. We found an additional 20 wheat accessions that carry at least parts of this segment. We defined the start and end of the L3 segments in these 20 accessions by determining the difference between the variation value of BW_01028 (L3) and the corresponding variation value of the twenty accessions. If the difference was â¤150, we defined the accession to carry the L3 segment.
Bread wheat D genome subpopulations contribution
The approach used for a quantitative estimation of the contributions of the different subpopulations to the D genomes and the estimation of technical artifacts are described in Supplementary Note 3 and Supplementary Table 33. The manual curation process that allowed counting the minimal number of hybridizations required to explain the presence of different haplotypes is described in Supplementary Note 4.
Data visualization
We used the R package karyoploteR (v1.20.3)97 for the haplotype representation of the chromosomes in Figs. 3d,e and 4b and Extended Data Fig. 7. The remaining plots were produced with ggplot2 (v3.4.2)98 and the Python seaborn library (v0.11.2)78. Maps in Figs. 1a, 3c and 4a and Extended Data Fig. 6b were generated using QGIS (v3.32.3).
Germplasm availability
All the 60 wheat landraces analysed in this study listed in Supplementary Table 14 are available upon request from the CIMMYT (https://www.cimmyt.org/) and ICARDA (https://www.icarda.org/) gene banks. Seed of accessions from the Open Wild Wheat Consortium Ae. tauschii Diversity Panel collection, Cereal Crop Wild Relatives (Triticeae) collection, DFW Wheat Academic Toolkit collection and Deposited Published Research Material collection can be obtained from the Germplasm Resource Unit (GRU) of the John Innes Center; seed from accessions with WGRC bank ID as the only primary ID (Supplementary Table 1) can be obtained from the Wheat Genetics Resource Center (WGRC) of Kansas State University; 34 accessions can be obtained from the Plant Gene Resources of Canada (PGRC); 84 accessions donated by the Institute of Botany, Plant Physiology and Genetics of the Tajikistan National Academy of Sciences were deposited in the Wheat Genetics Resource Center (WGRC) as were 37 accessions donated by Quaid-i-Azam University; 20 accessions donated by the Azerbaijan National Academy of Sciences can be made available upon request.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.