Thursday, October 24, 2024
No menu items!
HomeNatureOrigin and evolution of the bread wheat D genome

Origin and evolution of the bread wheat D genome

Establishing Ae. tauschii genomic resources

Plant material

We compiled a database comprising 1,124 Ae. tauschii accessions with associated passport data in Supplementary Table 1 (Supplementary Note 1). Duplicate germplasm bank IDs were identified and passport data collated using the Plant Genetic Resources for Food and Agriculture (PGRFA) database (https://www.genesys-pgr.org/) or other sources as indicated in Supplementary Table 1. From this database, seed of 228 non-redundant accessions were obtained from the Open Wild Wheat Consortium Ae. tauschii Diversity Panel collection deposited at the Germplasm Resource Unit (GRU) of the John Innes Centre; 48 accessions from the Cereal Crop Wild Relatives (Triticeae) collection of the GRU; 19 accessions from the Designing Future Wheat (DFW) Wheat Academic Toolkit collection of the GRU that have been used as synthetic hexaploid wheat D genome donors; 223 accessions from the Wheat Genetics Resource Center (WGRC) of Kansas State University; 34 accessions from the Plant Gene Resources of Canada (PGRC); 84 accessions donated by the Institute of Botany, Plant Physiology and Genetics of the Tajikistan National Academy of Sciences; 20 accessions donated by the Azerbaijan National Academy of Sciences; and 37 accessions donated by Quaid-i-Azam University. Accession P-99.95-1.1 was obtained from the Deposited Published Research Material collection of the GRU.

We also resequenced and analysed 60 hexaploid wheat landraces. The list can be found in Supplementary Table 14. Out of the 60 wheat landraces, 57 were received from the International Maize and Wheat Improvement Center (CIMMYT) and 3 were from the International Center for Agricultural Research in the Dry Areas (ICARDA).

Re-sequencing of Ae. tauschii and hexaploid wheat accessions

In this study, we generated short-read WGS data for 350 Ae. tauschii accessions (Supplementary Table 1) and 59 hexaploid wheat accessions (Supplementary Table 14). We isolated DNA following the CTAB protocol described by Abrouk et al.47 from leaf tissue of 2-week-old seedlings under prior dark treatment for 48 h. DNA was quantified using the Qubit dsDNA HS Assay (Thermo Fisher Scientific) and purity was determined according to 260/280 and 260/230 absorbance ratios using a Nanodrop spectrophotometer. PCR-free paired-end libraries were constructed and sequenced on an Illumina Novaseq 6000 instrument, yielding a median 8.3-fold coverage per sample (ranging from 5.87- to 16.86-fold) for the Ae. tauschii samples and a minimum tenfold coverage for the bread wheat samples (Supplementary Tables 1 and 14). Library preparation and sequencing was performed as a service by Novogene.

Library construction and RNA sequencing

Seedlings of Ae. tauschii accessions TA10171, TA1675 and TA2576 were raised as 5–6 seeds per pot (6 × 6 × 10 cm) in a growth chamber at 22–24 °C under long-day photoperiods of 16/8 h day/night cycle with high-output white-fluorescent tubes until the third leaf stage (about 2–3 weeks old), and then transferred to a 4 °C growth chamber with a long-day photoperiod for vernalization. After a nine-week vernalization period, all the plants were moved back to the original growth chamber under the controlled conditions mentioned above. In total, 45 tissue samples were collected: From each of the three accessions, three biological replicates were taken from each of: young leaf, root, stem, flag leaf and inflorescence. Samples were collected at the same time of day at approximately 5–6 h after daylight. The seedling leaves and roots were harvested after two weeks of recovery in the original growth chamber and rinsed with water to remove soil particles. When the plants had 4–5 tillers, the stems, flag leaves and inflorescences were harvested together. The green inflorescences were collected immediately after pollination. The 5-cm-long stem sections and youngest flag leaves were measured from the top of the same inflorescences. Samples were placed in liquid nitrogen after harvest and stored at −80 °C.

The samples were ground into a fine powder in liquid nitrogen in a ceramic mortar and pestle to isolate RNA using the Qiagen RNeasy Mini Kit following the manufacturer’s protocol. The quality of RNA was determined on a 1% agarose gel, and RNA concentration was measured using a NanoPhotometer (Implen) at 260 nm and 280 nm. Sample collection time and relative details are listed in Supplementary Table 23. High-quality RNA samples were delivered for RNA integrity test, poly-A mRNA enrichment, library construction and PE100 sequencing using the Illumina NovaSeq system (Génome Québec, Canada).

PacBio HiFi genome sequencing; primary assembly of the Ae. tauschii genomes and CWI 86942

We selected 46 Ae. tauschii accessions, including 11 L1 accessions, 34 L2 accessions and 1 L3 accession. These 46 accessions were selected to span the geographical range of the species (Fig. 1a) and provide a collection of phenotypes related to disease and pest resistance, abiotic tolerance and agromorphological traits of strategic interest to the Open Wild Wheat Consortium for bread wheat improvement (Supplementary Table 4). We included a higher proportion of accessions from L2 relative to L1 based on reported phylogenies showing that L2 is more genetically diverse than L15,14,21. A single L3 accession was selected based on low genetic diversity observed among five non-redundant L3 accessions in the phylogeny reported by Gaurav et al.14. Several accessions were selected to maximize the genetic diversity based on a core subset sampling analysis using Core Hunter (v3)48, using the ‘average entry-to-nearest-entry’ distance measure, aiming to maximally represent the diversity of the panel of 242 non-redundant accessions published by Gaurav et al.14. The bread wheat landrace CWI 86942 was selected based on a high L3 k-mer content.

For the Ae. tauschii accessions, ‘high molecular weight’ genomic DNA was isolated from leaf tissue of three to four-week-old dark-treated seedlings. We followed the high molecular weight DNA isolation protocol optimized by Driguez et al.49 for long-read sequencing. DNA integrity was confirmed using the FemtoPulse system (Agilent). DNA was quantified using the Qubit dsDNA HS Assay (Thermo Fisher Scientific) and purity was determined according to 260/280 and 260/230 ratios using a Nanodrop spectrophotometer. For the bread wheat accession CWI 86942, leaves from two-week old seedlings were collected from two different plants and high molecular weight DNA extraction was performed as mentioned above49. All the library preparation and Circular Consensus Sequencing (CCS) was performed on a PacBio Sequel II instrument, as a service by Novogene.

For Ae. tauschii, HiFi reads were assembled using hifiasm (v0.16.1)50 with parameters “-l0 -u -f38” optimized for homozygous and large genomes (-l0 -f38) and to minimize misassemblies by disabling the post-join contigs step (-u), favouring accuracy over contiguity. Sequencing coverage ranged from 18 to 47-fold depending on the accession, except for the three Ae. tauschii lineage reference accessions (TA10171, TA1675 and TA2576) for which the coverage was increased to 67 – 97-fold. For assembly validation and quality control, we used QUAST (v5.0.2)51 to calculate the assembly metrics, Merqury (v1.3)52 to estimate the base-call accuracy and k-mer completeness based on 21-mer produced from the short-read WGS data14 and BUSCO (v5.3.1)17 with the embryophyta_odb10 database to determine the completeness of each genome assembly. The number of homozygous SNPs and short insertion–deletion mutations (indels) was determined comparing the HiFi assemblies against the respective WGS data (Supplementary Table 24). They are in the range of 3,416–40,885 homozygous SNPs or indels per accession.

For CWI 86942, we performed the primary contig-level assembly with 484.33 Gb of HiFi reads (32-fold coverage) using the LJA assembler (v0.2)53 with default parameters. Assembly metrics and QC were performed with QUAST (v5.0.2)51 and BUSCO (v5.3.1)17 with the embryophyta_odb10 database.

Chromosome conformation capture sequencing and chromosome-scale scaffolding

In situ Hi-C libraries were prepared for TA1675 and TA10171 from two-week-old Ae. tauschii plants according to the previously published protocol54. Libraries were quantified and sequenced (paired-end, 2 ×111 cycles) using the Illumina NovaSeq 6000 device (Illumina) at IPK Gatersleben55, yielding 316 million paired-end reads (150 bp) for TA1675 and 215 million paired-end reads for TA10171.

For TA2576, two-week-old, dark-treated leaf tissue samples were harvested and cross-linked with formaldehyde for library preparation and Hi-C sequencing by Phase Genomics, yielding 543 million paired-end reads (150 bp). For CWI 86942, two Omni-C libraries were generated and sequenced from two-week-old, dark-treated leaf tissue samples as a service by Dovetail Genomics. The total yield was 715 million paired-end reads (150 bp).

Scaffolding into pseudomolecules for TA10171, TA1675, TA2576 and CWI 86942 was performed from their primary assemblies and their specific Hi-C and Omni-C data, respectively. Hi-C and Omni-C reads were processed with Juicer (v1.6)56 (for the Hi-C reads, parameter: -s DpnII) to convert raw fastq reads to chromatin contacts and remove duplicates. The chromatin contacts were used to scaffold the contig-level assemblies using 3D-DNA (v190716)57 (using run-asm-pipeline.sh with -r 0 parameter). Scaffolds were visualized, manually oriented and ordered using Juicebox (v2.20.00)58.

RagTag assembly of 43 Ae. tauschii accessions

The remaining 43 contig-level assemblies were scaffolded into chromosome-scale assemblies using RagTag (v2.1.0)59 and the three high-quality genomes (TA10171, TA1675 and TA2576) as anchors. In brief, the primary contig-level assemblies were scaffolded using RagTag scaffold against the respective chromosome-scale reference assemblies generated in this study. After running RagTag scaffold, the placed contigs had the exact same lengths as the primary contigs before running RagTag scaffold. Also, the number of gaps in each RagTag assembly corresponds to the number of placed contigs minus seven (number of chromosomes) (Supplementary Table 6). This indicates that RagTag scaffold did not introduce misassemblies or duplicated contig ends. The scaffolded assemblies were validated with dot-plots generated using MashMap (v3.0.6)60 against the corresponding reference assembly. While being great resources for gene discovery and comparative analyses, reference-guided assemblies are limited in their ability to study large structural rearrangements.

Repeat and gene annotation

Paired-end RNA-seq reads for TA10171, TA1675 and TA2576 were first cleaned using Trimmomatic (v0.40)61 with the following settings “ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:30 TRAILING:30 MINLEN:36”. Trimmed paired-end reads were aligned to the corresponding genome assembly using STAR (v2.7.10b)62 with the parameters “–twopassMode basic –outFilterMismatchNMax 5 –outFilterMatchNminOverLread 0.80 –alignMatesGapMax 100000 –outSAMstrandField intronMotif –runMode alignReads” and the results were filtered and sorted using SAMtools (v1.10)63. Then, the BRAKER (v3.0.3)64,65,66 pipeline was used to predict de novo gene models using RNA-Seq and protein data mode with the Viridiplantae protein models provided by OrthoDB (v11). Predicted gene annotations obtained from BRAKER were processed using a combination of NCBI BLAST+ (v2.9.0-2)67, AGAT (v1.2.1) (https://github.com/NBISweden/AGAT), InterProScan (v5.64-96.0)68,69, and R (v4.2.0). Outputs from BRAKER3 were first converted to gff3 and CDS and protein sequences were extracted using “agat_sp_extract_sequences.pl” from AGAT package. BLASTn was used to perform a reciprocal BLAST of the predicted CDS against themselves, and a unidirectional BLAST against the Ensembl nrTEplantsJune2020.fa repetitive elements database, using default search parameters. The putative functions for each annotated gene model were predicted using InterProScan with default parameters for the following databases: FunFam, SFLD, PANTHER, Gene3D, PRINTS, Coils, SUPERFAMILY, SMART, CDD, PIRSR, ProSitePatterns, AntiFam, Pfam, MobiDBLite, PIRSF, NCBIfam. R (v4.2.0) (in R studio) was used to visualize and filter these results. Predicted transcripts with fewer than 50 exact and fewer than 150 inexact self-BLAST results were retained. Predicted transcripts were retained from the final de novo annotation if there were (1) no exact matches to the transposable elements database, and (2) at least one domain predicted by any of: FunFam, PANTHER, Gene3D, SUPERFAMILY, ProSitePatterns, Pfam, CDD, InterPro. Predicted genes were considered as ‘low confidence’ if there were no exact matches to the database of original transcript predictions. The remaining annotated genes were considered as ‘high confidence’. Validation and annotation completeness was performed using agat_sp_statistics.pl and BUSCO (v5.4.7)17 running in transcriptome mode with the poales_odb10 database. We used OrthoFinder70 (v2.5.4) with default parameters to perform gene family analysis.

Repeat annotation was performed using RepeatMasker (v4.1.2-p1)71 and the Ensembl nrTEplantsJune2020 repetitive elements database72 using the RMBlast engine.

For bread wheat accession CWI 86942, gene model prediction was performed using a lifting approach similarly to the one described in Abrouk et al.73 with a combination of liftoff (v1.6.3)74, AGAT and gffread (v0.11.7)75. In brief, gene model annotations of hexaploid wheat line Chinese Spring, Kariega, Fielder, ArinaLrFor, Julius and Norin61 were independently transferred using liftoff (parameters: -a 0.9 -s 0.9 -copies -exclude_partial -polish) and all the output gff files were merged into a single file using the Perl script agat_sp_merge_annotations.pl. The merged file was then post-processed using gffread tools (parameters: –keep-genes -N -J) to retain transcripts with start and stop codons, and to discard transcripts with (1) premature stop codons, and/or (2) having introns with non-canonical splice sites. In total, 147,646 gene models were predicted for which the putative functional annotations were assigned using a protein comparison against the UniProt database (2021_03) using DIAMOND (v2.1.8)76 (parameters: -f 6 -k1 -e 1e-6). PFAM domain signatures and GO were assigned using InterproScan (v5.55-88.09)68,69. The BUSCO score showed a completeness of 99.2% (96.4% duplicated) with the poales_odb10 database17.

k-mer matrix generation, redundancy and diversity analyses

k-mer matrix generation

We developed an optimized k-mer matrix workflow to generate a presence/absence k-mer matrix for large diversity panels (Supplementary Note 2) (https://github.com/githubcbrc/KGWASMatrix). We counted k-mers (k  =  51) in raw sequencing data for 350 accessions generated in this study, 306 accessions published by Gaurav et al.14, 275 accessions published by Zhou et al.15 and 24 accessions by Zhao et al.6. The 35 accessions with less than fivefold sequencing coverage were discarded to avoid affecting the k-mer count. k-mers with a count of one were discarded prior to generating the k-mer matrix. k-mers were retained in the k-mer matrix by a minimum occurrence of 6 across accessions and a maximum occurrence of (N − 6), where N is the total number of accessions.

Redundancy analysis

A redundancy analysis was performed using a subset of 100,000 random k-mers sampled from the k-mer matrix of 920 Ae. tauschii accessions. The complete matrix contained 10,078,115,665 k-mers. The pairwise comparisons between accessions were performed by computing the sum of the presence–absence values (0 and 1) per k-mer between 2 accessions of the matrix. To determine the divergence, we computed the total number of 1 present in the summed string, each one corresponding to a difference in the presence/absence of the k-mer in the 2 compared accessions. In the sum, the presence of a k-mer in two accessions would result in a 2 and the absence in both accessions in a 0. A threshold of 96% shared k-mers was used to call redundancy based on control lines determined by Gaurav et al.14 to be genetically redundant based on a SNP analysis.

Estimation of the genetic diversity in the 46 Ae. tauschii accessions selected for high-quality genome assemblies

We computed the k-mer accumulation across the 46 Ae. tauschii accessions by analysing their k-mer presence or absence in the k-mer matrix. First, we extracted a k-mer sub-matrix for the 46 accessions and removed k-mers that were absent from all accessions. The k-mer accumulation was computed by counting the number of k-mers present in the first accession, then adding new k-mers from the second accession (that is, not present in the previous accession) and sequentially adding new k-mers until accession 46. This computation was iterated 100 times using randomly shuffled sub-matrices, and the mean and standard deviation were calculated. The mean cumulative k-mer counts were fitted to a logarithmic function (y = a + b × log(x)] using the Python function optimize.curve_fit from SciPy library (v1.8.0)77. The fitted data were plotted using the Python seaborn library (v0.11.2)78 to visualize the k-mer-based accumulation curve. We calculated the k-mer frequency across the full panel of 920 Ae. tauschii accessions in comparison to the genetic diversity in the 46 accessions. The k-mers were divided into two groups: k-mers present and absent in the 46 accessions. We extracted k-mer sub-matrices for each group and computed the occurrence of the k-mers across the 920 accessions (Extended Data Fig. 1c). We plotted the square root transformation of the k-mer frequency using the Python seaborn library (v0.11.2)78.

SNP calling

Fastq raw reads were trimmed using Trimmomatic (v0.38)61 with the following settings “ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36”. Cleaned reads were mapped on the TA1675 assembly using BWA mem (v0.7.17)79 and sorted with SAMtools (v1.8)63. Variants were called using BCFtools mpileup (v1.9)63 with the setting “-C 60 -q 5 -Q 20”, and only SNPs were retained as variants. The filtering was performed using BCFtools, retaining only sites with a maximum depth of 40, a quality higher than 100 and an allele count higher than 1. For quality check, we counted the percentage of divergent sites using re-sequencing data from TA1675 against the chromosome-scale TA1675 reference assembly, revealing an error rate of 0.18%. For the assessment of assembly quality (Supplementary Table 24), homozygous indels were also retained (maximum number of raw reads supporting an indel (IDV) = 3, maximum fraction of raw reads supporting an indel (IMF) = 0.3, depth between 5 and 40, a quality higher than 30 and an allele count higher than one). We further computed the SNP density across the chromosomes and calculated allele frequency (Extended Data Fig. 8a–c and Supplementary Tables 25–27). In total, 957 Ae. tauschii and 59 wheat landraces accessions reached the quality threshold of a coverage higher than fivefold after trimming and were used for SNP calling. The SNP data set was used for phylogenetic, ancestry, and nucleotide diversity analyses.

The phylogenetic tree was built from the filtered SNPs using vcfkit (v0.1.6)80 with the UPGMA algorithm. Ancestry analysis was performed using the sNMF (Fast and Efficient Estimation of Individual Ancestry Coefficients) approach available in the LEA R package (v3.10.2)81. For each run, we performed 20 repetitions using the following parameters “alpha = 10, tolerance = 0.00001, iterations = 300” up to K = 28. Supplementary Table 28 shows the minimum cross-entropy values for 20 sNMF runs across different values of K.

We estimated the nucleotide diversity (π) in the 493 non-redundant accession of Ae. tauschii using the filtered SNP calls against the TA1675 reference assembly. We calculated π over 10-kb windows of the genome using VCFtools (v0.1.16)82 (parameter –window-pi 10000).

Structural variant calling

We determined the structural variation across the 46 high-quality assemblies with reference to the chromosome-scale assembly of TA1675. Structural variants of >50 bp in length and up to 100 kb were called using the PacBio structural variant calling and analysis suite (pbsv) (v2.9.0) and following the pipeline described at (https://github.com/PacificBiosciences/pbsv). In brief, HiFi sequencing reads in bam format were aligned to the reference genome using pbmm2 aligner (v1.10.0) (https://github.com/PacificBiosciences/pbmm2). The bam file was indexed as CSI suitable for larger genomes. Signatures of structural variation were detected and structural variants were called per accession in vcf format, then concatenated into a single bed file per lineage.

The Ae. tauschii genomes facilitate gene discovery

k-mer-based genome-wide association in Ae. tauschii

We followed the k-mer GWAS (kGWAS) pipeline described by Gaurav et al.14 using the Python scripts available at (https://github.com/wheatgenetics/owwc/tree/master/kGWAS) and the phenotype data for stem rust and leaf rust available for this panel to specifically run the association mapping and plotting using default parameters. The association mapping analyses showing the effect of assembly quality in Ae. tauschii accession TA1662 were performed using previously published phenotype data for reaction to Pgt race QTHJC14. The kGWAS for leaf rust to identify Lr39 in Ae. tauschii accession TA1675 was performed using phenotype data for reaction to Pt race BBBDB (Supplementary Table 29).

SrTA1662 haplotype analysis

To identify the SrTA1662 locus in the contig-level assembly of Ae. tauschii accession TA1662, we performed a BLASTn (v2.12.0)67 search of the SrTA1662 gene sequence (GenBank ID MW526949.1) published by Gaurav et al.14. To identify the SR33 locus in the contig-level assembly of Ae. tauschii accession CPI 110799 (the original source of Sr33), we searched for the RGA1e (also known as Sr33) gene sequence (GenBank ID KF031291.1) published by Periyannan et al.24. RGA1e gene sequences were also searched against the contig-level assemblies of accessions AUS 18911 (KF031299.1) and AUS 18913 (KF031284.1)24. For all accessions, the genes were found within a single contig that located to the chromosome arm 1DS based on the scaffolding against the TA1675 reference assembly. In the four accessions, we performed BLASTn searches for additional Ae. tauschii resistance gene analogues (RGA1a-d, RGA2a-b, RGA3a) reported by Periyannan et al.24 (GenBank ID KF031285.1–KF031299.1). To confirm that this region is orthologous to the Mla locus in barley, we searched for the presence of the pumilio (Bpm) gene homologue and subtilisin-chymotrypsin inhibitor (CI) genes in gene-lifting annotations for AUS 18911, AUS 18913, CPI 110799 and TA1662. Bpm and CI genes were previously reported flanking Resistance Gene Homologues (RGH) of the Mla locus83. The gene-lifting annotations were generated using liftoff v1.6.174 with default parameters based on the TA1675 genome annotation.

Phylogenetic analysis of RGAs in Ae. tauschii

To provide further evidence for the homology of the SrTA1662 (SR66) and SR33 loci in Ae. tauschii and the Mla locus in barley, we performed phylogenetic analyses of RGA and RGH gene sequences. Clustal algorithm with default parameters was used for the DNA sequence multiple alignment. We used the unweighted UPGMA algorithm with bootstrap testing to support the tree topology with 5,000 replicates. The phylogenetic analyses were performed using MEGA (v11)84,85.

Leaf rust inoculations and association studies

The evaluation of resistance and susceptibility in 149 Ae. tauschii L2 accessions was conducted against the North American Pt race isolate BBBDB 1-186 using seedlings organized in cone racks. Every cone rack housed 98 cones, and each cone was sown with three seeds. The primary leaves of seedlings, aged 8–9 days, were subjected to inoculation by distributing 1 ml of inoculum per cone rack, which consisted of 15 mg of spores in 1 ml of Soltrol 170 (Chevron Phillips Chemical Company). The delivery to each plant was 0.05 mg of urediniospores. Post-inoculation, the phytotoxicity from the oil carrier, Soltrol 170, was mitigated by mildly fanning the leaves for 2 h under the illumination of 400-Watt HPS bulbs to expedite the evaporation of the carrier oil. The seedlings, once inoculated, were placed in mist chambers maintained at 22 °C, where 100% humidity was sustained using a domestic ultrasonic humidifier for a period of 16–18 h in the absence of light. Subsequently, the seedlings were transferred to a greenhouse with a 16-h day cycle, maintaining nocturnal and diurnal temperatures at 15 °C and 20 °C, respectively. The phenotypic assessment of disease was undertaken at 10 and 12 dpi using an infection type scoring range of 0 to 3 + , as standardized by Long and Kolmer87, and depicted as the mean of three individual replicates per accession (Supplementary Table 29).

For use in GWAS, the qualitative scores were converted to a quantitative score by assigning numerical values to the infection types. This was achieved by the kGWAS pipeline (https://github.com/wheatgenetics/owwc/tree/master/kGWAS) that performs Stakman IT to numeric scale 1 conversion (RunAssociation_GLM.py with -st parameter).

Bi-parental mapping of LR39 and candidate gene identification

An Ae. tauschii bi-parental mapping population (n = 123) was generated by crossing the leaf rust resistant Ae. tauschii accession CPI 110672 (synonymous TA1675) with the leaf rust susceptible accession CPI 110717. The mapping population was segregating for a single dominant leaf rust resistance gene (P = 0.606) when inoculated with the Australian Pt isolate 26-1,3 (PBI culture no 316) and phenotyped at the Plant Breeding Institute, Cobbitty30. Bulk segregant analysis of selected homozygous resistant and susceptible F2 progenies with the 90 K SNP array88 placed the leaf rust resistance locus to chromosome 2DS. The mapping population was further genotyped with markers derived from the 90 K iSelect SNP array, the TA1675 genomic sequence, and a marker closely linked to LR39 (Xgdm35) (Supplementary Table 30 and 31)28,89. Linkage analysis was performed using MapDisto (v2.0)90 with default parameters such as LOD (logarithm of the odds) threshold of 3.0, maximum recombination frequency of 0.3 and removal of loci with 10% missing data. Genetic distances were calculated using the Kosambi mapping function, and the map was created using MapChart (v2.32)91. Markers flanking LR39 were anchored to the TA1675 reference assembly. Annotated high-confidence genes at the delimited physical interval were screened for protein homology using BLASTp to identify diversity between TA1675 and AL8/78 (Supplementary Table 13). The conserved domains and critical residues of WTK and NLR were identified using the amino acid sequences in the NCBI Conserved Domain search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and InterPro (https://www.ebi.ac.uk/interpro/search/sequence/) databases. The polymorphic SNP corresponding to R457I in WTK was converted to a KASP marker diagnostic for Lr39 (Supplementary Table 31).

Virus-induced gene silencing

To develop candidate gene-specific VIGS probes, the predicted coding sequences of candidate genes were searched against the TA1675 transcriptome database using siRNA-Finder (siFi21) software (v1.2.3-0008)92. Based on the RNA interference (RNAi) design plot, regions predicted to have a higher number of efficient probes and fewer off-targets were used for designing silencing probes for the WTK (LrSi2:258 bp, LrSi6:254 bp and LrSi7:248 bp) and the NLR (LrSi3:234 bp and LrSi8:257 bp) candidate genes. The silencing probe sequences were verified for specificity using a BLAST search against the TA1675 reference assembly (<80% sequence identity for hits other than the target candidate). Designed probes were flanked by XbaI and ApaI and synthesized at GenScript Biotech followed by cloning into the BSMV-γ vector in an antisense direction. The resulting constructs were transformed into Agrobacterium tumefaciens strain GV3101. The Agrobacterium clones were grown overnight at 28 °C in lysogeny broth with appropriate antibiotics. Cells were collected by centrifugation at 3,500g for 10 min, then re-suspended using infiltration buffer (10 mM MgCl2, 10 mM MES pH 6.5 with KOH buffer and 150 mM acetosyringone) and adjusted to an OD600 of 1.0 followed by incubation at 28 °C for 3 h. Equal volumes of BSMV-α and BSMV-β were mixed with respective BSMV-γ silencing probes or BSMV-γGFP and infiltrated into Nicotiana benthamiana leaves. Infiltrated leaves were collected 5 days after infiltration and homogenized with virus inoculation buffer (10 mM monopotassium phosphate containing 1% Celite (Thermo Fisher Scientific, 68855-54-9)). The homogenate containing viral particles was rub inoculated onto five to ten seedlings of TA1675 at the three leaf stage. After two weeks of recovery and viral symptoms appearing, the seedlings were inoculated with Pt isolate B9414. Prior to inoculating TA1675, Pt isolate B9414 was propagated on seedlings of the susceptible wheat cultivar Thatcher. Freshly collected urediniospores were suspended in Isopar L and sprayed onto plants using a high-pressure air sprayer. After inoculation, plants were placed in the dark overnight in an incubation box equipped with a humidifier and then transferred to a growth chamber with a 16/8 h day/night cycle, with 21 °C/18 °C growth conditions. Leaf rust phenotypes were recorded at 12 days after inoculation by scanning the leaves at 600 dots per inch on an Epson Perfection V850 Pro scanner. For leaf rust biomass quantification, DNA was extracted from Pt-inoculated leaves using the CTAB method47. DNA concentrations were measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific). A 20 µl qPCR reaction containing Power SYBR Green PCR Mix (Applied Biosystems 4367659), ~25 ng of DNA, and primers specific to the Puccinia 28 S large subunit or the internal transcribed spacer region93 and Triticeae elongation factor-specific primers94 was run using the ABI QuantStudio 6 Flex Real-Time PCR machine. The 2−ΔΔCT method was used to normalize rust gene amplification values relative to the Ae. tauschii elongation factor endogenous control.

PCR conditions

A 20 μl PCR containing 100 ng of genomic DNA, 1X GoTaq Flexi green buffer, 1.5 mM MgCl2, 200 μM dNTP, 200 nM primers and 1 unit of Taq polymerase (M829B, Promega) was used for various fragment amplifications. Primer sequences are shown in Supplementary Table 31. A touchdown PCR protocol was used as follows: initial denaturation at 94 °C for 30 s; annealing at 65 °C for 30 s, decreasing by 1 °C per cycle; and extension at 72 °C for 60 s, followed by repeating these steps for 14 cycles. After enrichment, the program continued for 29 cycles as follows: 94 °C for 30 s, 58 °C for 30 s, and 72 °C for 60 s. PCR products of cleaved amplified polymorphic sequence (CAPS) markers were digested with appropriate restriction enzymes by following the manufacturer’s instructions. A 5 μl reaction (2.5 μl of KASP Master Mix (Low ROX KBS-1016-016), 0.07 μl of assay mix and 2.5 μl (25 ng) of DNA) was used for KASP markers. PCR cycling was performed in an ABI QuantStudio 6 Flex Real-Time PCR machine as follows: preread at 30 °C for 60 s; hold stage at 94 °C for 15 min; and then ten touchdown cycles (94 °C for 20 s; touchdown at 61 °C, decreasing by 0.6 °C per cycle for 60 s), followed by 29 additional cycles (94 °C for 20 s; 55 °C for 60 s). The plates were then read at 30 °C for endpoint fluorescent measurement.

Tracing lineage-specific Ae. tauschii haplotype blocks in the bread wheat genome

Missing link finder pipeline

We generated canonical 51-mers for each of the 82,293 genotyped wheat accessions from Sansaloni et al.35 using their respective DArTseq markers and Jellyfish (v 2.3.0)95. For each accession, the k-mers were sorted and stored as text files. From the k-mer matrix available from Gaurav et al.14, k-mers present only in Ae. tauschii L3 were extracted, sorted, and stored as text files. Pairwise comparisons of the sample-specific k-mers from the 82,293 wheat accessions and the L3-specific k-mers were performed using the comm bash command. Jaccard indices were computed with the following formula, where A is the set of k-mers from a single accession and L is the L3-specific k-mer set.

$$J(A,L)=\frac{A\cap L}{A\cup L}$$

The script is available on Github (https://github.com/emilecg/wheat_evolution).

Determining the extent of L3 in wheat lines using whole-genome re-sequencing data of 59 hexaploid wheat landraces

Raw reads were trimmed using Trimmomatic (v0.38)61 with the following settings “ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36”. KMC (v3.1.2)96 was used to generate 31-mer sets for the 59 resequenced wheat landraces (Supplementary Table 14). IBSpy (v0.4.6)13 was run with TA2576 as a reference and the bread wheat landraces as queries with a k-mer size of 31 and a window size of 50,000 bp as parameters. A variation score threshold of ≤150 was used to determine how many windows were in common between the L3 reference and the wheat landraces. IBSpy variation values of ≤150 were determined to be optimal to account for the relatively low intra-lineage variation present in L3 (Extended Data Fig. 9 and Supplementary Table 32). The percentage of matching 50-kb windows was used as a proxy to determine the extent of introgression in the landraces.

Differences in genes in the 135-Mb L3 introgression block on chromosome 1D

The protein sequences of genes annotated in the interval of the introgression on the Kariega genome and on the CWI 86942 genome were compared using DIAMOND and visualized with the Persephone genome browser. In case genes were annotated in both the genomes, their amino acid sequences were aligned using the Needleman–Wunsch algorithm to determine the percentage of identity. The absence of genes in one of the two annotations was investigated manually with the BLAST algorithm integrated into Persephone. Annotated genes found to be part of transposable elements were excluded from the analysis.

Presence of the 135-Mb L3 haplotype block on chromosome 1D in wheat landraces

The presence of the 135-Mb L3 haplotype block was manually confirmed in 12 out of the 126 wheat landraces (Supplementary Table 16). CWI 86942 and another Georgian landrace (CWI 86929) had the largest block size (Extended Data Fig. 6d).

To further determine how widespread the presence of the chromosome 1D L3 segment was, we downloaded the IBSpy variation file from 1,035 hexaploid wheat accessions (827 landraces and 208 modern cultivars) and L3 line BW_01028 (https://opendata.earlham.ac.uk/wheat/under_license/toronto/WatSeq_2023-09-15_landrace_modern_Variation_Data/IBSpy_variations_10WheatGenomes/) against the Chinese Spring RefSeq v1.0 assembly38. We found an additional 20 wheat accessions that carry at least parts of this segment. We defined the start and end of the L3 segments in these 20 accessions by determining the difference between the variation value of BW_01028 (L3) and the corresponding variation value of the twenty accessions. If the difference was ≤150, we defined the accession to carry the L3 segment.

Bread wheat D genome subpopulations contribution

The approach used for a quantitative estimation of the contributions of the different subpopulations to the D genomes and the estimation of technical artifacts are described in Supplementary Note 3 and Supplementary Table 33. The manual curation process that allowed counting the minimal number of hybridizations required to explain the presence of different haplotypes is described in Supplementary Note 4.

Data visualization

We used the R package karyoploteR (v1.20.3)97 for the haplotype representation of the chromosomes in Figs. 3d,e and 4b and Extended Data Fig. 7. The remaining plots were produced with ggplot2 (v3.4.2)98 and the Python seaborn library (v0.11.2)78. Maps in Figs. 1a, 3c and 4a and Extended Data Fig. 6b were generated using QGIS (v3.32.3).

Germplasm availability

All the 60 wheat landraces analysed in this study listed in Supplementary Table 14 are available upon request from the CIMMYT (https://www.cimmyt.org/) and ICARDA (https://www.icarda.org/) gene banks. Seed of accessions from the Open Wild Wheat Consortium Ae. tauschii Diversity Panel collection, Cereal Crop Wild Relatives (Triticeae) collection, DFW Wheat Academic Toolkit collection and Deposited Published Research Material collection can be obtained from the Germplasm Resource Unit (GRU) of the John Innes Center; seed from accessions with WGRC bank ID as the only primary ID (Supplementary Table 1) can be obtained from the Wheat Genetics Resource Center (WGRC) of Kansas State University; 34 accessions can be obtained from the Plant Gene Resources of Canada (PGRC); 84 accessions donated by the Institute of Botany, Plant Physiology and Genetics of the Tajikistan National Academy of Sciences were deposited in the Wheat Genetics Resource Center (WGRC) as were 37 accessions donated by Quaid-i-Azam University; 20 accessions donated by the Azerbaijan National Academy of Sciences can be made available upon request.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

RELATED ARTICLES

Most Popular

Recent Comments