Genetic architecture of sugarcane traits in a polyploid genomics framework

Plant materials

For genome assembly and annotation, we used plant materials including, S. hybrid POJ2878, S. spontaneum 82-114 and S. officinarum XZ. These were cultivated in the greenhouse at Shenzhen. Tender leaves were collected for subsequent genome sequencing.

This study also resequenced 981 sugarcane accessions, including 78 S. officinarum, 290 S. spontaneum accessions and 613 cultivated varieties. Among the cultivated varieties, 246 were non-Chinese germplasms, while 327 were core domestic cultivars or breeding lines in China. The plant materials were grown at three experimental sites representing distinct climates: (1) a tropical monsoon climate at the sugarcane research base in Lingao County, Hainan Province (19° 34′–20° 02′ N, 109° 3′–109° 53′ E, altitude: 34 m); (2) a subtropical monsoon climate in Fusui County, Guangxi Zhuang Autonomous Region (22° 38′ 06″ N, 107° 54′ 15″ E; altitude 94 m); and (3) a subtropical plateau climate in Kaiyuan City, Yunnan Province (23° 30′–23° 58′ N, 103° 04′–103° 43′ E; altitude 1,051 m). Buds were first disinfected and then planted using a single-bud method within a randomized complete block design (1.2 m row spacing; 15 cm plant spacing). To ensure uniform germination, the field was subject to standardized management, including regular irrigation and pest and weed control.

DNA and RNA sequencing

DNA extraction

DNA extraction for single-molecule sequencing involved collecting samples and preparing long genomic DNA. The DNA was then purified using the QIAGEN Genomic Kit. For ONT-UL sequencing, the SDS method was used to extract DNA, omitting the purification step to preserve long DNA molecules. The DNA integrity and purity were evaluated using the NanoDrop system. Moreover, the Qubit 4.0 Fluorometer was used to measure DNA concentration.

PacBio sequencing

The library preparation process included gDNA shearing, DNA damage repair, end repair, A-tailing, hairpin adapter ligation, nuclease treatment, size selection and polymerase binding. A total of 15 μg of genomic DNA was used for each sample and sheared using g-TUBEs. We removed single-strand overhangs and performed repair, end-repair and A-tailing on the DNA fragments. Subsequently, we ligated the fragments with hairpin adapters, treated them with nuclease and purified the samples. We used the BluePippin system to select fragments with target sizes and the library size was verified in Agilent 2100. HiFi reads were sequenced on the PacBio Sequel Revio platform.

ONT-UL sequencing

Genomic DNA (8–10 μg per sample) was selected for fragments larger than 50 kb using the SageHLS HMW library system. The ONT-UL reads were sequenced on the PromethION platform.

RNA-seq

High-quality RNA (RIN > 8), extracted and purified using the Omega Bio-tek and Takara kits, was used to construct libraries (NEBNext Ultra) for NovaSeq paired-end sequencing after Agilent 2100 validation.

Isoform sequencing

The full-length RNA libraries were constructed based on the Iso-Seq Express 2.0 kit and the Kinnex kit. The cDNA synthesis was performed, followed by PCR amplification with barcoded Iso-Seq primers. The barcoded full-length cDNA products were pooled and sequenced on the PacBio Revio platform.

Pore-C library construction and sequencing

The Pore-C library was constructed using an enhanced Pore-C (ePore-C) protocol developed in our companion study¹⁴. The ePore-C protocol largely follows the previously published Pore-C methodology¹², with several refinements designed to improve data yield. In brief, approximately 2 g of young plant tissue was finely ground in liquid nitrogen and mixed with a prechilled extraction buffer. Cross-linking was initiated by adding formaldehyde to the homogenate, incubating for 25 min with periodic vacuum release and then stopped with glycine. The mixture was filtered and the filtrate was layered onto a sucrose buffer and centrifuged to isolate nuclei, followed by further purification using gradient centrifugation with sucrose and Percoll. The nuclei were washed with PBS.

Cross-links were reversed using proteinase K, SDS, Tween-20 and RNase A at 56 °C for 18 h. The sample was divided for DNA extraction using phenol–chloroform, and the aqueous phase was combined and precipitated with NaCl, sodium acetate and ethanol. The DNA pellet was washed, air-dried and resuspended in water. Fragments larger than 1 kb were selected using DNA Clean Beads. DNA concentration and quality were assessed with Qubit and confirmed by agarose gel electrophoresis.

Using a modified Nanopore Ligation Sequencing Kit (2.5 µg input; around 1 µg yield), DNA underwent repair and end-prep at 20 °C and 65 °C and ligation and clean-up at 30 °C before flow cell loading for sugarcane genome analysis.

Genome assemblies

Contig-level assembly

Approximately 30× HiFi reads, combined with ONT-UL reads exceeding 100 kb, were processed using the hifiasm⁴⁶ (v.0.19.1-r559) program using its default settings. The resulting p_utg file, which contained haplotype-phased contigs, was then used for subsequent haplotype-resolved scaffolding.

Chromosomal-scale and haplotype-resolved assembly using C-Phasing

The complex hybrid sugarcane genome assembly was successfully achieved using a haplotype-resolved genome scaffolding program, C-Phasing, which was developed in our companion study¹⁴. This program comprises five core modules: Hitig, Methalign, Hyperpartition, Kprune, and Scaffolding. In brief, Hitig uses split-read alignment with ONT Ultra-long reads to identify and correct chimeric and collapsed errors, which are critical for accurate haplotype phasing. Methalign leverages the long reads of Pore-C data to improve alignment accuracy by combining traditional sequence alignment with allele-specific 5mC site information to filter ambiguous mappings and enhance read alignment accuracy. Kprune identifies allelic contigs through pairwise alignments and uses a maximum-matching algorithm to detect and remove interallelic and crossallelic signals, retaining only intrahaplotype interactions for further partitioning. Hyperpartition implements a two-layer partitioning strategy using the Louvain community detection algorithm⁴⁷. It processes the raw hypergraph from Methalign to group allelic contigs into homologous chromosome groups (HCGs). By reweighting edges between allelic and cross-allelic contigs, it refines the hypergraph for accurate haplotype phasing. Scaffolding integrates functionalities from HapHiC⁴⁸ and ALLHiC⁴⁹ to efficiently sort and assemble contig groups using parallel computing. This module addresses the challenge of missing signals in homologous regions, ensuring consistent chromosome orientation across haplotypes. The combined approach substantially enhanced the phasing results, which were then manually corrected using Pore-C interactions in Juicebox⁵⁰. This process ultimately yielded a high-quality, haplotype-resolved genome assembly for hybrid sugarcane.

Assessment of genome assembly

We compared the quality of the POJ2878 genome assembly with three published hybrid sugarcane cultivars (R570 (ref. ⁷), ZZ1 (ref. ⁶), and XTT22 (ref. ⁸)) based on three key criteria: continuity, completeness and accuracy.

Continuity

We measured genome continuity by calculating the contig N50 and the number of gaps for each assembly. This enabled us to compare the structural continuity of the four sugarcane genome assemblies.

Completeness

To evaluate genome completeness, we first used the BUSCO⁵¹ tool with the embryophyta_odb10 database to assess the presence of conserved single-copy orthologous genes. Moreover, we applied Meryl⁵² (v.1.4.1) to generate k-mer databases from next-generation sequencing reads by splitting them into 21-mers. Using these k-mer databases, we calculated k-mer completeness using Merqury⁵² (v.1.3). Using LTR_FINDER_parallel⁵³ (v.1.2) and LTR_HARVEST_parallel (v.1.1; https://github.com/oushujun/LTR_HARVEST_parallel), we identified LTR elements in each genome, which were further processed using LTR_retriever⁵⁴ (v.2.9.6). This library was used to annotate LTRs across the genome and calculate the proportion of intact LTRs relative to total LTRs. To provide a genome-wide measure of LTR completeness, we used the raw LTR Assembly Index⁵⁵. Furthermore, we evaluated genome completeness by analysing the alignment rate of next-generation sequencing reads to the assemblies, the scaffold anchoring rate and the ratio of assembled genome size to estimated genome size. These metrics collectively provided a comprehensive view of the completeness of the four sugarcane genome assemblies.

Accuracy

The base-level accuracy was calculated by Merqury (v1.3). We then assessed structural accuracy using three metrics: Switch error, Collapse error and Hi-C H-trans error.

Switch errors: we extracted the 1,000 longest ultra-long ONT reads and divided them into 5-kb windows. Each window was searched back to the genome using blastn⁵⁶ (v.2.9.0), and windows with a primary alignment score at least 1.5× higher than the secondary alignment (and exceeding a score of 1,000) were considered to be valid. Reads with at least five valid windows were retained for analysis. If at least 50% of the valid windows aligned to a single chromosome, the read was assigned to that chromosome. Any valid window that failed to align to the same chromosome was classified as a switch error. The switch error rate was defined as the proportion of switch error windows relative to all valid windows.

Collapsed regions: to identify collapsed regions, we divided the genome into 50-kb non-overlapping windows. Mosdepth⁵⁷ (v.0.3.9) was used to calculate the depth of each window, denoted as x_i. The set of all x_i values is represented as X, and the frequency of each x_i is denoted as P(x_i). The mode of X was then determined based on the following formula:

$$\mathrmMode(X)=\P(x_i)=\max (P(x_j)),\forall x_j\boldsymbol\in X\$$

If x_i exceeds 1.5× mode(X), the depth of that window is considered abnormally high due to collapse. The extent of genomic collapse was quantified by dividing the total size of windows exhibiting abnormal depths by the size of the entire genome. This ratio is defined as the collapse error.

Hi-C H-trans error: Hi-C alignments by BWA⁵⁸ with a mapping quality score (MAPQ) of 0 were discarded to ensure reliable mapping. We then used C-Phasing to convert the BAM files into contact maps, capturing interaction frequencies between genomic regions. For each pair of homologous chromosomes i and j, we defined intrachromosomal interaction counts as C_ii and C_jj, and interchromosomal interaction counts as C_ij. The Hi-C H-trans error for the chromosome pair was calculated using the formula:

$$E_ij=\fracC_ij(C_ii+C_ij+C_jj)$$

This error rate reflects the proportion of interactions that erroneously link different haplotypes or non-homologous chromosomes, indicating potential misassemblies or chimeric joins. We averaged the H-trans error across all homologous chromosome pairs to obtain a genome-wide estimate.

Identification of telomeric and centromeric sequences

To identify telomeric and centromeric sequences in the POJ2878 genome, we used quarTeT⁵⁹ (v.1.2.2), using its TeloExplorer and CentroMiner modules for targeted analysis.

Telomeric sequence identification

Using the TeloExplorer module, we first identified the most enriched telomeric-like repeat sequences at the ends of each chromosome. The distribution of these repeats was then analysed to determine the precise location and copy number of telomeric regions on each chromosome.

Centromeric sequence identification

For centromeres, we applied the CentroMiner module to detect tandem repeat monomers. Candidate centromeric sequences were selected on the basis of their repeat period and copy number. To reduce redundancy, the candidate monomers were clustered, and representative monomers were aligned back to the chromosomes. Regions with continuous matches were designated as candidate centromeric regions. These regions were further refined by considering match length and the abundance of retrotransposons, which are characteristic of centromeric regions. Finally, centromere types were classified based on the ratio of the distance between the centromere and the chromosome end (D) to the total chromosome length (L). Chromosomes were categorized as: (1) telocentric if D/L < 0.001; (2) acrocentric if 0.001 ≤ D/L < 0.1; (3) submetacentric if 0.1 ≤ D/L < 0.4; and (4) metacentric if 0.4 ≤ D/L ≤ 0.5.

Analysis of genome architecture

Subgenome identification

To dissect the distribution of ancestral genetic components in the genome, this study selected ten individuals from each representative ancestral lineage to generate high-quality resequencing data (with sequencing depth of over 10×). Using the POJ2878 genome as a reference, we used the commercial GTX (v.2.1.0; http://www.gtxlab.com/en/product/cat) software package for SNP detection and implemented stringent quality control measures. Only SNPs meeting the criteria of alignment quality > 30, base quality > 30 and sequencing depth > 5 were retained, while complex variant types were excluded, leaving only biallelic sites. This high-confidence dataset was used for subsequent analyses.

We identified lineage-specific SNPs by defining S. officinarum-specific sites as those where >70% of S. officinarum and S. spontaneum samples were homozygous for the reference (0/0) and alternate (1/1) alleles, respectively. The reciprocal pattern was used to define S. spontaneum-specific SNPs.

We performed a sliding-window analysis with a 50 kb window size and 10 kb step size to examine the distribution of these ancestral components across the genome. We counted the number of S. officinarum-specific and S. spontaneum-specific sites in each window. A window was classified as belonging to the S. officinarum lineage if it contained at least 1.5× more S. officinarum-specific sites than S. spontaneum-specific sites, and vice versa for the S. spontaneum lineage. Regions that did not meet these criteria were labelled as undetermined (UN).

For regions marked as UN, we applied two methods for further classification. The first method used an optimization strategy based on neighbouring information, reflecting the tendency of chromosomal recombination to occur in large, linked segments. If an UN region was flanked by at least three consecutive windows of the same lineage, it was reclassified according to its neighbours. The second method inferred the dominant ancestral affiliation of the entire chromosome. For each chromosome, windows that were not labelled UN and contained at least five S. officinarum– or S. spontaneum-specific SNPs were considered to be valid. If more than 50% of these valid windows belonged to the S. officinarum lineage, the chromosome’s dominant ancestry was classified as S. officinarum, and any UN regions on that chromosome were redefined accordingly; the same process applied for the S. spontaneum lineage.

Identification of recombined chromosomes

The identification of recombined chromosomes was also based on a sliding window approach with the window size set up to one-tenth of the chromosome length and a step size of 1 Mb. If a chromosome contains two or more windows where regions with lower ancestry proportions exceed 60%, it is classified as a recombined chromosome between two subgenomes.

Identification of HEs

The homoeologous exchanges (HEs) were identified using classified genomic windows (that is, S. officinarum– and S. spontaneum-assigned windows) as described above. HEs involving a switch from the S. spontaneum subgenome to the S. officinarum subgenome were detected when two consecutive windows, located within an S. officinarum-derived non-recombined chromosome, showed more than 70% of resequenced S. spontaneum samples supporting their origin from the S. spontaneum species. Similarly, HEs involving a switch from the S. officinarum subgenome to the S. officinarum subgenome were identified following the same criteria in reverse.

Design of oligo-FISH probes and chromosome painting

Haplotype-specific oligonucleotides were designed against the POJ2878 reference genome using Chorus2 software⁶⁰ with parameters ‘-l 40 -homology 93 -d 5 -step 1’. The initial oligonucleotides were subsequently filtered against whole-genome sequencing data to eliminate sequences with high coverage depth, which may represent potential repetitive sequences. The resulting oligonucleotides were designated as POJ2878 haplotype-specific probes. Chromosome painting by oligo-FISH was performed as described previously⁶¹. Probes were labelled by PCR amplification with Cy3-dUTP or FITC-dUTP (Roche Diagnostics). The hybridization cocktail was composed of 10 μl deionized formamide, 2 μl 20× SSC, 4 μl 50% dextran sulfate and approximately 100 ng of oligonucleotide probes. After denaturation at 95 °C for 10 min, the mixture was applied to denatured chromosome slides and incubated at 37 °C for 16 h, followed by standard washing procedures⁶¹. Chromosomes were counterstained with DAPI (Vector Laboratories) and imaged using an Olympus BX53 fluorescence microscope with a cooled CCD camera, then processed through cellSens Dimension 1.9.

Allele-defined gene annotation and expression

Genome annotation

An integrative approach was used for gene prediction, combining evidence from three sources: ab initio prediction, homologous protein alignments and RNA-seq read mapping. This process was implemented using the GETA program, using clean RNA-seq datasets, homologous proteins and the POJ2878 genome as references. The GETA pipeline used sequencing-depth-based coverage thresholds to filter low-quality transcripts and refine intron predictions. These refined structures were then used to guide the iterative training of gene models in AUGUSTUS⁶² (v.3.3.3), a process that continued until the highest predictive scores were achieved. Homologous proteins from six related plant species (S. spontaneum, Oryza sativa, Sorghum bicolor, Erianthus rufipilus, Zea mays and A. thaliana)—retrieved from Phytozome and GWH database—were further refined using GENEWISE⁶³ (v.2.4.1) for protein identification. The final gene models were selected based on sequence conservation, expression levels and AUGUSTUS support scores.

To ensure the reliability of annotations, a stringent filtering process was applied. Genes were aligned against the NR database and the PFAM conserved domain database, and their expression levels were calculated using RNA-seq data. Genes were retained if they met at least one of the following criteria: (1) presence of a homologous protein in the NR database; (2) presence of a conserved domain in the PFAM database; or (3) an FPKM value greater than 3 with sequence coverage exceeding 50%.

Annotation and analysis of TEs

The transposable elements (TEs) of each genome were identified using the EDTA⁶⁴ package using the default parameters. This method provides a whole-genome de novo TE annotation by integrating structural and homology-based approaches, filtering out false positives from raw TE candidates and generating a high-quality, non-redundant TE library for comprehensive genome-wide TE annotations. RepeatMasker (http://www.repeatmasker.org) and RepeatModeler2 (ref. ⁶⁵) were used as the TE libraries for homology-based annotation. Insertion time for full-length LTR-RTs were determined based on the divergence between flanking LTR sequences. Based on the Kimura two-parameter model⁶⁶, we applied the rate⁶⁷ of 6.5 × 10⁻⁹ to the formula T = K/2r to convert genetic distance into absolute time.

Identification of allelic and homoeologous genes

To identify genes with defined alleles, paralogous genes and haplotype-specific genes in the complex polyploid sugarcane genome, we developed a three-step method, as illustrated in Extended Data Fig. 3:

Identification of paralogues within each haplotype genome. We constructed the haplotype genome by selecting one chromosome from each homologous chromosome group in the order of haplotype numbering (that is, from g1 to g13). Paralogues within each haplotype genome were identified using hierarchical clustering combined with tree-cutting methods, based on sequence identity and coverage. To ensure accuracy, we dynamically determined the tree-cutting height to achieve paralogue identities exceeding 90% (category 6). Specifically, sequence alignment was performed using DIAMOND⁶⁸ (–query-cover 60 –subject-cover 60 –id 80), generating a gene–gene matrix where each element represented the product of coverage and identity between gene pairs. Hierarchical clustering was then applied to this matrix using Ward’s method with Euclidean distance as the dissimilarity metric⁶⁹. This approach minimized within-cluster variance, ensuring that genes within the same cluster exhibited high similarity while maintaining clear distinctions between clusters. A dendrogram was constructed to represent the relationships among genes, with branch heights reflecting dissimilarity between clusters. Starting from a tree-cutting height of 2, the threshold was gradually reduced by 0.1 until paralogous gene pairs with over 90% identity were identified. This step was performed within each subgenome to minimize interference in subsequent allele identification.

Identification of syntenic and non-syntenic allelic/homoeologous genes. Alleles were identified using JCVI⁷⁰ to compare synteny blocks across haplotype genomes. Each haplotype genome was used as a reference to ensure comprehensive gene representation. Syntenic blocks identified from different haplotype reference genomes were merged, and low-quality syntenic gene pairs were filtered using an identity threshold of 80%. High-quality syntenic gene pairs were classified as syntenic allelic or homoeologous genes (categories 1, 2 and 3). For non-syntenic homologous genes, if the gene exhibited over 90% sequence identity with any classified allelic/homoeologous gene, it was considered as non-syntenic allelic or homoeologous genes (category 4).

Clustering of unclassified genes and identification of single genes. Unclassified genes were further analysed using Broccoli⁷¹, a tool for identifying orthologous protein groups and pairs through phylogenetic and network-based methods. Broccoli performs rapid phylogenetic analysis, builds orthologous relationship networks and applies machine learning to classify groups. Using Broccoli, those unclassified genes were categorized as either non-syntenic allelic/homoeologous gene pairs (category 5) or haplotype-specific genes (category 7).

This three-step approach enabled the classification of genes into seven categories: categories 1–5 represent genes with defined alleles, category 6 includes paralogous genes and category 7 identifies haplotype-specific genes (Supplementary Table 9).

Allele-specific expression analysis using MAS-ISO-seq

To investigate allele-specific expression patterns in sugarcane, we developed a robust pipeline integrating MAS-ISO-seq data using the program Allele-Express (Supplementary Note 3). This approach enabled precise quantification of gene expression from individual alleles within the complex sugarcane genome.

Initially, raw MAS-ISO-seq data were processed to generate HiFi reads using the ccs (https://github.com/PacificBiosciences/ccs) tool. Stringent quality-control measures were applied, filtering reads based on minimum pass number, length, signal-to-noise ratio and quality score. These high-quality HiFi reads were then segmented using skera (https://github.com/PacificBiosciences/skera) split, guided by a custom primer design file, to separate reads originating from different targeted regions. Primer sequences, poly(A) tails and chimeric sequences were subsequently removed using lima (https://github.com/PacificBiosciences/barcoding) and isoseq refine (https://github.com/PacificBiosciences/IsoSeq), yielding full-length non-chimeric reads for downstream analysis.

To create a non-redundant reference set, similar transcripts were clustered using mmseqs⁷², retaining only one copy of identical sequences. The FLNC reads were then aligned to these clustered reference transcripts using minimap2⁷³, optimized for splice junctions. The sequencing depth at each site was calculated using bedtools⁷⁴, and saturation curves were generated to confirm sufficient sequencing depth.

A key component of our pipeline, Allele-Express, systematically identified and filtered candidate read alignments for each homologous gene group. Reads sharing the same identifier were grouped and a multi-layered filtering strategy was implemented. Initially, a priority alignment was selected on the basis of mapping quality, edit distance and alignment score. For ambiguous alignments, where a read could potentially map to multiple alleles, we developed two complementary approaches:

(1)

Read depth stable point (RDSP) method: this approach identifies the point within a transcript at where the read depth stabilizes, discarding reads aligning beyond this point as ambiguous.
(2)

Read clustering method: this method performs a comprehensive sequence comparison of all full-length MAS-ISO-seq reads using a greedy incremental clustering algorithm. Ambiguous reads are assigned to a specific transcript if other reads within the same cluster align uniquely to that transcript.

Finally, Allele-Express quantified read counts for each transcript and aggregated them to the gene level using transcript-to-gene mappings. Read counts were normalized using TPM to enable comparisons across samples. This integrated pipeline, combining MAS-ISO-seq and Allele-Express, provides a comprehensive analysis of allele-specific expression in sugarcane, offering valuable insights into the regulatory mechanisms governing this complex polyploid genome.

Haplotype counts were determined using allele-aware annotations derived from the haplotype-resolved genome assembly. Although multiple copies of a gene may be annotated, only haplotypes containing at least one mutation, compared with alternative forms, were included in the analysis. For example, while a total of 12 copies of the MNS2 gene were annotated, only eight haplotypes were retained for further analysis: three allelic genes from S. spontaneum and five from S. officinarum.

ASHEGs were identified using the following criteria: within each allele group, the allele with the highest TPM must exhibit an expression level at least 2.5× greater than the second-highest allele. A Grubbs test was conducted to identify outliers, with a P < 0.05 indicating statistical significance. The allele with the highest TPM was considered a candidate if it qualified as an outlier. The second-highest allele in the group must not be classified as an outlier.

Population genetics

Whole-genome resequencing and variant detection

Sequencing data were aligned to the reference genome POJ2878 using GTX (v2.1.0), ensuring unique alignments by filtering out ‘SA’ and ‘XA’ tags in BAM files and retaining reads with alignment scores AS:i > XS:i and mapping quality mapq > 20. Variants were called using HaplotypeCaller tool, and SNP filtering was performed with bcftools⁷⁵ (v.1.10.2) using stringent criteria: QD < 2.0||FS > 60.0||MQ < 40.0||MQRankSum < −12.5||ReadPosRankSum < −8.0||SOR > 3.0. Additional filtering retained only biallelic SNPs with minor allele frequency (MAF) ≥ 0.05 and a missing rate ≤50%. A total of 11,427,420 high-quality SNPs was identified for downstream analysis. SNP annotation was performed using SnpEff⁷⁶ (v.5.2c). Among these SNPs, 656,089 (5.74%) were shared between S. spontaneum and S. officinarum in the POJ2878 genome (defined as sites with call rate >50% in both populations). To assess whether shared loci correspond to conserved homologous sequence contexts across progenitor genomes, 500 loci were randomly sampled from the 656,089 shared set and evaluated for syntenic context using the T2T monoploid assemblies of S. spontaneum 82-114 and S. officinarum XZ. A locus was deemed to be conserved if its SNP coordinate and ±200 bp flanking sequence were identifiable in POJ2878, S. spontaneum 82-114 and S. officinarum XZ within syntenic regions. Based on this criterion, 57.8% (289 out of 500) of loci were conserved across all three genomes, while 27.2% (56 in S. officinarum and 80 in S. spontaneum) were detected in at least one progenitor genome. The remaining unmatched loci probably reflect local assembly gaps, structural or presence–absence variation among genomes, and/or stringent matching thresholds, rather than mapping artefacts.

Phylogenetic analysis and population structure

Phylogenetic relationships were inferred using VCF2Dis (v.1.52; https://github.com/BGI-shenzhen/VCF2Dis) to construct a neighbour-joining tree. Population structure was analysed with Admixture⁷⁷ (v.1.3.0). Linkage disequilibrium (LD) decay was calculated and visualized using PopLDdecay⁷⁸ (v.3.42), and PCA was conducted using VCF2PCACluster⁷⁹ (v.1.40).

Identification of shared IBD regions and favourable haplotypes

To investigate shared IBD regions across sugarcane accessions, we used IBDseq⁸⁰ (r1206) with its default settings. We quantified the contribution of each IBD region by introducing a metric, termed IBD density, which measures the proportion of varieties sharing IBD segments with POJ2878. The formula for calculating IBD density is:

$$\rmIBD\,\rmdensity=\frac\rmNumber\,\rmof\,\rmvarieties\,\rmsharing\,\rmIBD\,\rmsegments\,\rmwith\,\rmPOJ2878\rmTotal\,\rmnumber\,\rmof\,\rmvarieties\,\rmexamined\,\rmin\,\rmthe\,\rmpopulation$$

The genome was analysed in 50-kb non-overlapping windows, and regions within the top 5% of IBD density were designated as hotspots of POJ2878 contributions, representing favourable haplotypes. These hotspots mark genomic regions with substantial genetic contributions from POJ2878, and probably harbour key alleles associated with important agronomic traits in sugarcane.

Selective sweep analysis

Selective sweeps within populations were identified using three complementary methods: SweeD⁸¹ (v.4.0.0), RAiSD⁸² (v.2.9) and nucleotide diversity (π). Genomic regions that ranked in the top 5% by any of these methods were considered to be candidate selective sweep regions. Regions identified by at least two methods were classified as high-confidence selective sweeps, providing robust evidence of natural selection. To detect interpopulation selective signals, F_ST values and XP-CLR scores were calculated using VCFtools⁸³ (v.0.1.16) and XP-CLR (https://github.com/hardingnj/xpclr), respectively. Domestication-related selective sweeps in S. officinarum were identified by comparing it to 43 resequencing individuals of S. robustum, downloaded from a previous publication⁵. Moreover, hybrid sugarcanes were compared against both S. spontaneum and S. officinarum to detect selective signals specific to hybrid populations. The thresholds for selective sweeps were set at the top 5% for hybrids and the top 5% for S. officinarum. All analyses used a sliding-window approach with 50-kb window size and 10-kb step size. Regions supported by both F_ST and XP-CLR were designated as high-confidence selective sweeps, highlighting key genomic signatures linked to domestication and hybridization.

Application of KMERIA in Sugarcane GWAS

Traditional SNP-based GWAS face great challenges when applied to polyploid genomes. These include genotyping errors due to ambiguous read alignments, reliance on biallelic variants despite the prevalence of multiallelic variants in polyploid genomes and the time-intensive nature of the process. To address these challenges, we applied a k-mer based GWAS approach (KMERIA), which has been submitted elsewhere as a companion paper¹⁵. This method leverages short, fixed-length sequence subsequences (k-mers) directly from resequencing data, eliminating the need for reference genome. The algorithm package has key steps. k-mers are counted from whole-genome sequencing reads of all samples using KMERIA with a recommended k-mer size of 31 bp for optimal efficiency. To improve accuracy, k-mers appearing once in a single individual are filtered out. A population k-mer matrix is built from individual k-mer counts using the KMERIA kctm function. Subsequently, k-mers at both very low and high frequencies (representing sequencing errors or repetitive regions) are filtered based on the genome ploidy and k-mer frequency distribution for each individual. k-mers exceeding a predefined individual missing rate are removed to ensure valid population-level associations. Individual sequencing depths are incorporated to account for their influence on k-mer copy number statistics.

To calculate the allele dosage, k-mer abundance values are transformed to better suit the GWAS model and enhance allele dosage estimation. The k-mer abundance below the 5th percentile is set to 0, and values exceeding the 95th percentile are set to 2. The remaining values are linearly transformed between 0 and 2 based on the original k-mer abundance value.

To perform GWAS using the transformed k-mers, we estimate the population structure and kinship relatedness through randomly selecting a small proportion (approximately 0.1%) of recoded k-mer genotypes. The association test further uses a mixed linear model incorporating a series of factors, including fixed effect, genotyped k-mers principal component vector, polygenic random effect vector and residual error.

To achieve precise genomic localization, KMERIA uses a read-based refinement strategy. Instead of mapping k-mers directly, we first retrieved sequencing reads that contain the significant k-mers. These longer, context-rich reads are then searched against the monoploid reference genome by BWA (default parameters) or minimap2 (option -x sr). This approach substantially improves alignment accuracy and enables unique localization of the associated genomic regions with high confidence. Once significant k-mers are anchored through this read-based mapping, KMERIA defines a candidate genomic interval for gene discovery. By default, a 50 kb window is surveyed on both sides of the mapped site, although this can be adjusted according to the LD decay of the studied population. All annotated genes within this interval are considered putative candidates underlying the associated trait.

To distinguish the most likely causal genes from the initial candidates, we integrated statistical causal inference with functional evidence. Fine-mapping was performed within the ±50 kb genomic windows centred on the significantly associated k-mers. We used the CAVIAR⁸⁴ to quantify the causal posterior probability (CPP) for each associated k-mer within the interval by integrating GWAS z-scores with a local LD correlation matrix calculated using PLINK2 using the –r-unphased square option⁸⁵. CAVIAR also identifies a minimal causal set, representing the smallest group of k-mers with a collective confidence level of ≥95% in capturing all true causal variants. Within this set, the top five k-mers with the highest CPP were designated as the primary statistical anchors for gene prioritization. To ensure the biological relevance of the statistically identified candidates, genes within the fine-mapped intervals were prioritized based on the following criteria: (1) priority was given to genes that either overlapped with the k-mer exhibiting the highest CPP or were located within its immediate flanking regions. (2) Candidates were referenced for functional orthologues with genes previously reported to be involved in biosynthesis or regulatory pathways in related species (such as S. bicolor, O. sativa and Z. mays). This integrated approach facilitated the transition from broad association intervals to high-confidence functional candidates.

Association analysis with parenchyma cell traits and Brix

Tissue sections of sugarcane stems

The experiment was conducted using different sugarcane cultivars grown in Lingao County, Hainan Province, China. The sampling was performed in September 2024, during the pre-mature growth stage, approximately 210 days after planting. The samples were collected from the sixth internode above the ground of the sugarcane stem.

Fresh sugarcane stem segments (5 mm × 3 cm) were fixed in 70% FAA (>48 h) and dehydrated through a graded ethanol series (60–95%; 24–72 h per step), followed by two rounds of 100% ethanol (12–24 h each), before resin embedding. Subsequently, dehydrated samples were infiltrated with T7200 photosensitive resin (KULZER, Technovit7200VLC) for 15–25 days, depending on the sample size. To prevent premature resin polymerization, the infiltration process was conducted in the dark. The infiltrated samples were placed in embedding moulds, and T7200 resin was added to cover the samples. Polymerization was carried out in a photocuring embedding machine (EXAKT, E520) for 12 h. The resin-embedded samples were glued to glass slides (Servicebio) using T4000 adhesive (KULZER, Technovit4000). The tissue surface was ground to a smooth and uniform plane using a hard tissue grinding machine (EXAKT, E400CS) with 1200-grit sandpaper. The samples were sectioned into 200 µm slices (EXAKT, E300CP) and polished to a final thickness of 20–30 µm. The sugarcane stem tissue sections were stained using the toluidine blue staining protocol as follows. The sections were immersed in toluidine blue staining solution (Servicebio, G1032) for 5 min. The degree of staining was monitored under a Nikon Eclipse E100 upright optical microscope. Overstained sections were differentiated with 95% ethanol. After staining, the sections were rinsed with tap water for 10 s and dried in an oven at 60 °C. The stained sections were cleared in xylene (SCRC, 10023418) for 30 s and mounted with neutral gum (SCRC, 10004160). The stained sections were observed under a Nikon DS-U3 imaging system. A blue-green hue characterized the lignified cell walls, distinct from the purple-blue of the cellulose walls.

Automated parenchyma cell analysis in sugarcane stems

To quantify parenchyma cell density and size in sugarcane stems, we developed an automated image analysis pipeline for high-resolution scanned virtual slides (SVS). This workflow comprises three key steps.

Image tiling and conversion: the high-resolution SVS files were systematically divided into non-overlapping tiles (2,560 × 2,560 pixels) and converted to a standard image format for efficient processing.

Background filtering and vascular bundle identification: to isolate parenchyma tissue, we first removed background areas. Images were divided into blocks (320 × 320 pixels), and large white blocks were identified as background and excluded. The remaining blocks were analysed to identify vascular bundles based on their distinct staining characteristics after Toluidine Blue treatment and their morphology.

This process involved further dividing the non-background image into smaller blocks (160 × 160 pixels) and calculating the average RGB colour values for each block, creating a colour feature matrix. Toluidine Blue staining creates a strong colour contrast between parenchyma cells and the sieve tube portion of vascular bundles. However, the xylem tissue within the bundles shares similar colour and morphological features with parenchyma cells, making differentiation challenging. To address this, a Gaussian smoothing filter was applied to the colour feature matrix, reducing noise and enhancing colour transitions. This smoothing process effectively corrected the xylem tissue colour to match the rest of the vascular bundle and also reduced dye-induced noise in the parenchyma regions. Finally, K-means clustering was applied to the smoothed colour feature matrix to separate vascular bundles and parenchyma cell regions into two distinct groups.

For cell segmentation and counting, parenchyma cells within the identified regions were segmented using Cellpose⁸⁶, using the cyto2 model with optimized parameters (minimum cell area, 6,400 pixels; cell probability threshold, 2) to ensure accurate detection. The number of parenchyma cells was then quantified based on these segmented regions.

This automated method enables precise identification and quantification of parenchyma cells in sugarcane stem cross-sections. To validate its accuracy, we compared automated counts and size measurements with manual counts from ten randomly selected stem cross-sections. A strong positive correlation was observed for both cell density (Pearson’s r = 0.9828, P < 0.0001) and cell size (Pearson’s r = 0.9155, P = 0.0002), demonstrating the reliability of our automated approach (Extended Data Fig. 7).

Assessment of Brix

For each sugarcane variety, juice was extracted from at least six mature stalks. The soluble solids content was measured using a digital refractometer (PAL-1, ATAGO). The refractometer was calibrated with distilled water and adjusted to 28 °C. The SSC values were expressed in degrees Brix.

Genetic transformation in the hybrid sugarcane cultivar

Vector construction and activation of Agrobacterium strains

The open reading frame sequence of ShSUT2, ShTIP1, ShCBL1 and ShATG18a were amplified by PCR and cloned into the downstream region of the maize Ubi promoter in the pCAMBIA3300 vector through homologous recombination. The construct was transferred into Agrobacterium tumefaciens EHA105 through freeze–thaw transformation. Before transformation, the positive clones were activated twice on YEP agar plates. The bacterial cells were then collected into the infection medium, incubated at 28 °C with agitation (90 rpm) for 60–90 min to achieve an optical density at 600 nm of 0.3–0.6, and subsequently used for callus infection and genetic transformation.

Embryogenic callus induction and transformation

The young leave rolls from the stem apex of sugarcane cultivar XTT22 were sterilized using ethanol and mercury chloride (HgCl₂), and then cut into 1–2-mm-thick sections. These sections were cultured on an induction medium (MS + 30 g l⁻¹ sucrose + 2 mg l⁻¹ 2,4-D) to induce embryogenic callus formation at 28 °C in darkness, changing the medium twice during the induction period until embryogenic calli emerged. Approximately 5 g of embryogenic calli was collected and incubated in Agrobacterium infection solution under agitation (90 rpm) at 28 °C for 30 min. Inoculated calli were dried and incubated at 22 °C in darkness for 3 days. Subsequently, they were transferred to recovery medium (MS + 30 g l⁻¹ sucrose + 2 mg l⁻¹ 2,4-D) for 1 week, followed by selection on screening medium (MS + 30 g l⁻¹ sucrose + 2 mg l⁻¹ 2,4-D + 2 mg l⁻¹ Basta + 300 mg l⁻¹ Timentin) for 25–30 days. Surviving healthy embryogenic calli were selected to culture on differentiation medium (MS + 30 g l⁻¹ sucrose + 1.5 mg l⁻¹ 6-BA + 2 mg l⁻¹ Basta + 300 mg l⁻¹ Timentin) to regenerate shoots, which were further rooted on rooting medium (MS + 30 g l⁻¹ sucrose + 1.5 mg l⁻¹ NAA + 2 mg l⁻¹ Basta + 300 mg l⁻¹ Timentin). Finally, the rooted seedlings were transplanted into plug trays for growth. Transgenic lines were verified using a PAT/bar lateral flow assay test kit (Genesino) using aqueous leaf homogenates (1:20, w/v).

Statistics and reproducibility

All experiments were repeated independently at least three times with similar results. This includes all assays from which representative images are shown, such as micrographs of plant tissue sections, plant phenotypes and gels.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.