Bimodal centromeres in pentaploid dogroses shed light on their unique meiosis

Plant material

For genome sequencing, we used the same individual of R. canina (S27), which has already been cytogenetically analysed³⁴ (voucher: GLM12396) from a natural stand (WGS84: 51.1732° N; 14.6271° E; Weißenberg, Saxony, Germany). A vegetative runner was dug on 28 March 2022 and planted in a pot. Clones of the collected plant specimen were cultivated in a greenhouse at the Max Planck Institute for Plant Breeding Research, Cologne, Germany.

Whole-genome sequencing

High molecular mass genomic DNA was isolated from leaves using the NucleoBond HMW DNA Kit (Macherey Nagel). A HiFi library was prepared according to the manual of the HiFi SMRTbell Libraries using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) with initial DNA fragmentation using Diagenode Megaruptor 3 and final library size binning into defined fractions by Blue Pippin with 0.75% agarose cassettes (Sage Science). The size distribution was again controlled by a Femto pulse system (Agilent). Size-selected libraries were then sequenced on the Sequel II device with a Binding Kit 2.0 and Sequel II Sequencing Kit 2.0 for 30 h using two SMRT cells (Pacific Biosciences). Moreover, a chromatin-capture library was prepared from 0.5 g of fresh-weight leaf material input. All treatments were performed according to the recommendations of the Dovetail Omni-C kit for plants (Dovetail Genomics). As a final step, an Illumina-compatible library was prepared (Dovetail) and paired-end 2 × 150 bp deep-sequenced on the HiSeq 3000 (Illumina) device. All libraries were sequenced at the Max Planck Genome Centre Cologne at the Max Planck Institute for Plant Breeding Research.

Genome assembly

A phased chromosome-level genome was assembled using the generated PacBio HiFi and Hi-C data. First, a phased primary assembly was obtained by running Hifiasm⁵⁵ using 50 Gb of PacBio HiFi reads in combination with Dovetail Omni-C reads with the following command: hifiasm -o out.phased.asm.hic –h1 hic.R1.fastq.gz –h2 hic.R2.fastq.gz hifi.reads.fastq.gz. In the default diploid mode, we generated two sets of phased contigs. Each set was further scaffolded to the chromosome scale using Salsa2⁵⁶, followed by successive rounds of manual curation and rescaffolding. We then identified 14 and 21 pseudochromosomes, respectively.

We used Benchmarking Universal Single-Copy Orthologues (BUSCO, v.5.4.0)⁵⁷ to evaluate the completeness of 35 chromosome-level scaffolds and for each of the four subgenomes. The lineage database used for running BUSCO was eudicots_odb10. The protein sequences were converted from the assembly using the GFF analysis toolkit AGAT: agat_sp_extract_sequences.pl -g annotation.gff -f genome.fasta -p –type cds -o protein.fasta. The k-mer-based tool Merqury (v.1.3)⁵⁸ was used to estimate both the completeness and base quality of the chromosome assembly. The quality value of the chromosome assembly was greater than 66.6, and the quality value of each chromosome was at least 62. Read k-mers were built from HiFi sequences by Meryl (v1.3) with a k-mer size of 31 bp.

Rca_S2 assembly correction

During the scaffolding step, we noted the absence of approximately 20 Mb (including the centromere) on chromosome Rca2_S1_h2. Further validation using fluorescence in situ hybridization (FISH) showed that this chromosome should indeed have a large array of CANR4, as found in its homologous chromosome Rca2_S1_h1 (Supplementary Fig. 1). We noted that this assembly error was probably generated by the presence of a small translocation found at the start of the missing region in Rca_S1_h2 compared with Rca_S1_h1. We further mapped the HiFi reads to the region present in Rca_S1_h1 and found robust evidence for the presence of five copies of this region in the genome. We then concluded that this region was incorrectly missing from the Hifiasm assembly due to its high degree of similarity between the S1_h1 and S1_h2 haplotypes. We therefore duplicated the Rca2_S1_h1 region (37265454–54065178 bp) and manually assigned it to the expected position on Rca_S1_h2 (from Rca2_S1_h2: 33518152 bp).

Assembly of DToL datasets

We also downloaded available data from the Darwin Tree of Life (DToL) project for another accession of R. canina (PRJEB79801) and for R. agrestis (PRJEB79880) and performed phased chromosome-level genome assemblies as described above. R. agrestis from DToL revealed two copies of the R4 subgenome; our previous studies suggested that some accessions of R. agrestis were of hybrid origin, which should then have two copies of S1 subgenome^45,59.

k-mer analysis for genome size and ploidy level estimation

k-mer analysis to estimate genome size was performed using jellyfish (v.2.3.0)⁶⁰ and Genomescope (v.2.0)⁶¹. The pentaploidy of R. canina was further confirmed and analysed using Smudgeplot (v.0.2.5)⁶¹.

Chloroplast genome assembly and phylogeny

To clarify the maternal lineage of the allopolyploid R. canina (S27), we assembled the plastid genome of the sequenced individual. GetOrganelle (v.1.7.7.0)⁶² was used to de novo assemble the first draft of the plastid genome using 2× 150 bp Illumina short-read data (Sequence Read Archive (SRA): ERS1370372). This toolkit implements Bowtie 2⁶³ to initially find reads mapped to a plant chloroplast database and SPAdes⁶⁴ for de novo assembly and iterative extension. During the assembly and iteration process, BLAST+⁶⁵ was used to identify off-target contigs, which were then removed or trimmed. The resulting plastid genome was then used as a reference for mapping the original reads back using Geneious Prime v.2023.2.1 (Biomatters), allowing only mapping of paired reads mapped nearby with a minimum overlap of 75 bp and a minimum overlap identity of 98%. The results were manually examined and corrected where necessary.

The initial annotation of the chloroplast genome was performed using GeSeq (v.2.03)⁶⁶. The annotation included the chloroplast inverted repeats (IRs), rps12 interspersed gene, protein-coding sequences, tRNAs and rRNAs using 55% identity as thresholds for annotation of proteins and 90% for DNA as well as RNAs. Furthermore, tRNAscan-SE (v.2.0.7)⁶⁷ and Chloë (v.0.1.0)⁶⁸ were used as additional annotators within GeSeq. The annotations were manually edited using Geneious Prime v.2023.2.1 (Biomatters). The presence of chloroplast genomes differing in the orientation of the single-copy units (large single-copy (LSC) region, small single-copy (SSC) region) was checked by selecting motifs from the border region of the IR and the single-copy units (LSC-trnH-GUG 5′-GGTTCAATTCCCGTCGTTC-3′ or LSC-rps19 5′-GTGACACGTTCACTGAAAAAA-3′ and IRb-rps19-rpl2-IGS 5′-AGACGAAGAAACAAATTCTAT-3′; SSC-ndhF 5′-TGTAATAATATAATAATTGAA-3′ or SSC-ycf1 5′-CGACCCTAAACGATGGAATCG-3′ and IRa-ycf1 5′-TTGAAAAACCCGTTGTAACTAT-3′), noting their relative orientation to each other on the same reads using SeqKit (v.2.6.1)⁶⁹.

The assembled R. canina chloroplast genome had a length of 156,650 bp and a classical quadripartite structure (Supplementary Fig. 18): a LSC of 85,634 bp (~56.57% of the plastid genome), a SSC of 18,878 bp (~12.05%) and two IR regions of 26,069 bp (~16.64% each). Different isomers were found to differ in the orientation of the SSC and LSC (flip-flop configuration).

We computed a chloroplast phylogeny using 37 samples, including sequences downloaded from GenBank and newly assembled data (Supplementary Data 17). The alignment was performed with MAFFT⁷⁰, and the phylogenetic tree was calculated using IQ-TREE⁷¹ with the following settings: -m MFP –con-tree –burnin 250 -B 1000 -T 36 –wbtl.

Identification of the bivalent-forming subgenome and comparative analysis

The assembled chromosomes were subjected to pairwise comparisons presented as dot plots using the Synteny Mapping and Analysis (SyMAP) tool⁷². Multiple alignments within the synteny groups comprising five R. canina chromosomes plus R. chinensis and R. rugosa assemblies were carried out in the CLC Genomics Workbench (CLC) using the ‘whole genome alignment’ plugin with the following parameters: minimum initial seed length of 250; minimum alignment block length of 250. The aligned chromosomes were subjected to pairwise comparisons. The similarity values were calculated as the block fraction of the two genomes that were aligned (that is, the alignment percentage) or as the percentage of exactly matching nucleotides within the aligned blocks (the average nucleotide identity).

Multifasta files containing assembled short-read sequences of pollen SCO loci from eight different dogrose individuals (three samples of R. canina, two samples of R. corymbifera (subsect. Caninae) and three samples of R. rubiginosa (subsect. Rubigineae)) that were not sequenced by long reads (Supplementary Data 17) were used as queries to map them with the software BWA⁷³ with the aln command to the R. canina chromosome assembly. From the sequence alignment map (.sam) file, those chromosome hits with only one alternative were filtered according to the ‘XA:Z:’ flag using a Python script written by GPT-4 (ChatGPT Plus, OpenAI). A bubble map displaying the mean counts of chromosome pairs within different subsections was drawn with ggplot2⁷⁴.

Synteny analysis

Chromosome synteny was analysed with the Synteny and Rearrangement Identifier (SyRI)⁷⁵. For this purpose, chromosomes of subgenomes S1_h1, S1_h2, S3, R3 and R4 were aligned against each other within each linkage group (Rca1–Rca7) by minimap2^76,77 using the following command: minimap2 -ax asm5 –eqx -t 16 genome1.fa genome2.fa | samtools sort -@8 > aln.sorted.bam. Moreover, subgenome S1 was also aligned with R. chinensis (NCBI: GCA_041222415)¹⁶ and R. canina subgenome R4 was aligned with R. rugosa (NCBI: GCA_958449725.1; https://www.darwintreeoflife.org/) to analyse its synteny. To keep all of the chromosomes arranged in the same order as R. canina and for better visualization, chromosomes 2, 5 and 7 of R. chinensis were inverted, and the chromosomes of R. rugosa were reordered to 6, 1 (inverted), 7, 5 (inverted), 2, 3 and 4, corresponding to chromosomes 1–7 in R. canina. SyRI was implemented for all of the aligned genome pairs using the following command: syri -c aln.sorted.bam -r genome1.fa -q genome2.fa -k -F B –nc 16. Visualization revealed only syntenic blocks over 50 kb, which was performed by plots: Python PLOTsr –sr rugosa_R4/syri.out –sr R4_R5/syri.out –sr R5_s3/syri.out –sr S3 _S2/syri.out –sr S2_S1/syri.out –sr S1_chinensis/syri.out –genomes genomes.txt -o out_50k.pdf -S 0.7 -W 10 -H 9 -f 10 –itx -s 50000.

Syntenic orthologues among the primary annotations of diploid strawberry Fragaria nilgerrensis⁷⁸, R. chinensis¹⁶, R. rugosa (GCA_958449725.1, DToL; https://www.darwintreeoflife.org/) and the five sets of chromosomes of R. canina were inferred using the GENESPACE (v.1.2.3)⁷⁹ pipeline with the default parameters. In brief, GENESPACE compares protein similarity scores into syntenic blocks using MCScanX⁸⁰ and Orthofinder (v.2.5.4)⁸¹ to search for orthologues/paralogues within synteny constrained blocks. Syntenic blocks were used to query pairwise peptide differences among progenitor alleles, determine divergence among progenitor orthologues using R. chinensis syntenic anchors and search for specific orthogroups.

Self-synteny and fractionation bias

Synteny information was obtained using the SynMap tool on the CoGe platform^82,83. Only genes within synteny blocks were considered, including not only gene pairs but also singleton genes in each genome that lost their counterpart in the other genome due to fractionation or other gene loss. The identification of syntelogues between species was performed using SynMap2 (https://genomevolution.org/wiki/index.php/SynMap2), which internally uses LAST for sequence alignments⁸⁴, and then fractionation bias was analysed with FractBias⁸⁵.

dN/dS analysis

Protein-coding sequences (CDSs) were extracted for each R. canina subgenome according to coordinates from the gene structural annotation file using GffRead (v.0.12.6)⁸⁶ and translated into amino acid sequences using the transeq command from EMBOSS⁸⁷. Additional amino acid sequences and CDSs of R. rugosa (BioProject: PRJNA1061178; https://www.darwintreeoflife.org/), R. chinensis (NCBI: GCF_002994745.2)¹⁴ and Fragaria vesca (NCBI: GCF_000184155.1)⁸⁸ were downloaded. The CDS and amino acid sequences were validated, for example, for correct start codons or methionine as the first amino acid in the proteins, using Python scripts. The confirmed proteomes were subsequently analysed using OrthoFinder⁸¹ to identify common single-copy orthologues. According to the protein IDs, FASTA files for each orthologue gene containing five proteins of R. canina together with the three of outgroups were aligned with MAFFT (v.7.490)⁷⁰. On the basis of the aligned proteins, corresponding CDSs were codon based aligned using PAL2NAL⁸⁹ and DNA alignments were transformed into PHYLIP format. The PAML pipeline⁹⁰ with yn00 was used in a looped pairwise mode over all PHYLIP files for each subgenome and outgroup to estimate the nonsynonymous (dN) and synonymous (dS) substitution rates, as well as their ratio (dN/dS = ω). The results based on the Yang–Nielsen⁹¹ method were extracted from PAML output files and combined using a Python script and graphical visualized with ggplot2⁷⁴ in the R environment⁹². To covert the relative evolutionary time (Time_t) from yn00 into absolute divergence time (T_Ma) in millions of years ago (Ma), we used F. vesca as a fixed calibration point with its fossil record of 2.96 Ma (refs. ^93,94). The relative divergence time Time_t for each pairwise compared gene was multiplied with a scaling factor as follows:

$$T_\rmM\rma=\rmT\rmi\rmm\rme_t\times \left(\frac2.96\rmM\rme\rma\rmn\,\rmt\rmi\rmm\rme_t\,\rmf\rmo\rmr\,F.\,vesca\,\rmv\rme\rmr\rms\rmu\rms\,Rosa\right)$$

All scripts were developed with the help of ChatGPT-4o (ChatGPT Plus, OpenAI).

Chromosome-level phylogenetic reconstruction

We first generated whole-chromosome multiple alignments in synteny groups 1–7 using the Whole Genome Alignment tool in ClC workbench (Qiagen). The algorithm identifies seeds, that is, short stretches of nucleotide sequences that are shared between multiple genomes but not present multiple times in the same genome. These seeds were then extended using a HOXD scoring matrix, and the HOXD substitution score was combined with an adjustment term based on k-mer frequency to avoid spurious matches to repetitive regions in the genome⁹⁵. The program parameters were as follows: minimum initial seed length, 250; minimum alignment blocks, 250; and mismatches in seeds, allowed. The chromosome phylogenies were constructed from multiple alignments using RAxML (v.8.2.12)⁹⁶ with the GTRGAMMAI model. The diploid accessions were chromosome assemblies of R. chinensis¹⁶ and R. rugosa. For R. rugosa, the original chromosomes were renamed to fit the R. chinensis synteny.

Subgenome-aware phasing of R. canina

We used SubPhaser⁴¹ (default parameters) to phase and partition the subgenomes of the pentaploid R. canina and R. agrestis by assigning chromosomes to subgenomes based on differential repetitive k-mers. These were assumed to have expanded during the period of independent evolution after divergence from the nearest common ancestor and before the stepwise hybridization events (the divergence–hybridization period). A subgenome is considered to be well phased when it displays distinct patterns of both differential k-mers and homoeologous chromosomes, confirming the presence of subgenome-specific features, as expected. As the S1_h1 and S1_h2 chromosomes represent haplotypes of the S1 genome, only the S1_h1 haplotype was used in the phasing analysis together with the sets of S2, R3 and R4 chromosomes.

LTR insertion times were calculated by Subphaser as follows: LTR-TRs were de novo detected using LTRharvest (v.1.6.1)⁹⁷ and LTRfinder (v.1.07)⁹⁸. To reduce false positives, TEsorter (v.1.3.0)⁹⁹ was used to reconstruct the classification of LTR-RTs and further refine this classification. The subgenome-specific k-mer sequences were mapped to the LTR-RT sequences using a substring match procedure to identify the subgenome-specific LTR-RTs using the Fisher’s exact test. Two LTRs of each subgenome-specific LTR-RT were retrieved and the nucleotide divergence was estimated using the Jukes–Cantor 1969 model. The insertion time (T) was calculated using the equation T = K/2r, where r = 1.3 × 10^–8 substitutions per year (default)¹⁰⁰, and K represents the divergence of the LTRs from the LTR-RT.

Flow cytometric determination of the endosperm/embryo ratio

To isolate the nuclei from embryo and endosperm tissue, nutlets from fruits of the sequenced individual of R. canina S27 (voucher: GLM12396) were first cracked with pliers. The embryo and endosperm were then carefully transferred into a droplet of nuclei isolation buffer (CyStain PI Absolute P; Sysmex-Partec) in a Petri dish and chopped with a sharp razorblade. After adding additional nuclei isolation buffer to a final volume of 500 µl, the nuclei suspensions were filtered through 50 µm disposable filters (CellTrics, Sysmex-Partec), stained with 4′,6-diamidino-2-phenylindole (DAPI) to a final concentration of 1.5 µg ml⁻¹ and stored on ice until use. The measurements were performed on a CyFlow Space flow cytometer (Sysmex-Partec) equipped with a high-power UV LED (365 nm).

SCOs

Plant material

To analyse the phylogenetic origin of the subgenomes of allopentaploid R. canina, we sampled 30 rose individuals of 24 diploid species across the genus Rosa. Thus, R. stellata (subgen. Hesperhodos), 10 species from sect. Synstylae, seven from sect. Rosa (Cinnamomeae, including sect. Carolinae), four from sect. Pimpinellifoliae and one individual of R. chinensis (sect. Chinensis) were sampled from the living collection of the Europa-Rosarium Sangerhausen. Moreover, one accession of R. majalis (sect. Rosa) was collected from the Botanical Garden Würzburg, and R. persica (subg. Hulthemia) from Botanical Garden Jena. Species were rechecked using their respective floras^101,102,103, and the material was compared with available herbarium specimens available online (JSTOR Global Plants, https://plants.jstor.org/; Moscow Digital Herbarium, https://plant.depo.msu.ru). Herbarium vouchers were deposited in the Herbarium Senckenbergianum Görlitz (GLM; Supplementary Data 17).

To determine bivalent-forming genomes, we sampled pollen from several individuals of Rosa sect. Caninae (subsect. Caninae: three 5x R. canina, two 5x R. corymbifera; subsect. Vestitae: one 5x R. pseudoscabriuscula; subsect. Rubigineae: three 5x R. rubiginosa; Supplementary Data 17). We collected anthers from 50 to 100 freshly opened flowers under dry weather conditions in early May 2021 in the field, stored them in open glass for 1 day to allow the anthers to open and subsequently transferred them to a 50 ml tube. Owing to electrostatic attraction, the pollen deposited on the walls of the tube. Anthers were carefully removed, and the pollen powder was collected at the bottom of the tube by gentle centrifugation. The pollen powder was then tapped out over clean paper and transferred to tubes with the help of a spatula. This procedure was repeated three times. Pollen grains were stored in a refrigerator until use.

Isolation and flow sorting of pollen nuclei

Nuclei of mature pollen grains were isolated by the filter bursting method¹⁰⁴ using the nuclear isolation buffer as described previously¹⁰⁵. Pollen grains were burst on the surface of a 20 μm disposable CellTrics filter (Sysmex-Partec). The resulting nuclear suspension was stained with propidium iodide (50 µg ml⁻¹, PI) and run on a BD Influx cell sorter (BD Biosciences). Nuclear populations were identified in a dot plot showing the PI fluorescence signal (log scale) versus side scatter signal (SSC, log-scale). A sort gate was defined based on the corresponding fluorescence intensity (lin-scale) histogram. A total of 200,000 individual generative nuclei (volume, around 400 µl) were collected into a 1.5 ml reaction tube using the ‘1.0 Drop Pure’ sorting mode of the BD FACS software (BD Biosciences). After adding 50 µl of 1× TE and 50 µl of NaN₃, the nuclei were sedimented by centrifugation (1,000g for 10 min at 4 °C). Next, 300 µl of the supernatant was removed, and the nuclei with the remaining liquid were stored at −20 °C. The gating strategy to isolate generative nuclei of R. canina is presented in Supplementary Fig. 19.

DNA extraction

DNA from diploid rose species was first extracted from 20 mg of silica-dried leaf tissue according to the ATMAB protocol¹⁰⁶ and subsequently purified using the Mag-Bind Total Pure NGS Kit (Omega Bio-Tek, Nocross) according to the manufacturer’s manual. DNA from flow-sorted pollen nuclei was extracted using the Mag-Bind Plant DNA DS Kit with the modification that permanent but careful mixing was performed during binding and elution because the DNA quantities ranged from 37 ng to 236 ng. The DNA yield was quantified using the Qubit 4 Fluorometer (Thermo Fisher Scientific).

Target construction

To analyse nuclear single-copy regions in rose genomes, we used published SCO tags¹⁰⁷. The SCO tags were originally developed to be amplifiable by PCR and covered coding as well as non-coding regions. We used the 29,000 sequences from additional file 3 from ref. ¹⁰⁷, which consisted of SCO tags of 17 rose species and seven outgroup species of the Rosaceae family. These sequences were filtered for uniqueness so that duplicates were removed and searched with BLAST in the R. chinensis haploid line genome (v.1.0)¹⁵. Owing to the structural gene model annotation of the R. chinensis genome, we were able to identify 923 full-length nuclear genes with single-copy characteristics. The target-capturing baits were designed by the Agilent bioinformatics service (I. Kisakesen, Agilent Technologies) and covered exons + UTRs with flanking regions and small introns of the selected genes in the R. chinensis genome. Finally, the target consisted of 5,794 sequences of different lengths (the shortest at 179 bp and the longest at 6,544 bp) named according to R. chinensis gene prediction and had a total size of 2 Mb (Supplementary Data 12). All target sequences were covered by 2× tiling with a total of 85,670 specific baits.

Sequencing

For target enrichment, we used the SureSelect XT HS2 DNA system with precapture pooling (Agilent Technologies) and target design as described above. For diploid roses, 200 ng of input DNA was used, and for pollen DNA, 36–200 ng of input DNA was sheared with a Bioruptor Pico sonication device (Diagenode) to a recommended fragment size of 180–250 bp. The Illumina short-read libraries were amplified for 9 cycles after adapter ligation, pooled for precapture to 16 samples and then postcapture library pools were amplified again with 12 cycles of PCR amplification. The library pools were sequenced in 150 bp paired-end mode on the Illumina NovaSeq 6000 system by Novogene with approximately 1 GB of data output per sample.

To analyse the ploidy of the samples, in vitro flow cytometry was performed on silica-dried leaflets according to a protocol described previously⁴⁵ using R. arvensis (2n = 2x = 14) as an internal standard. The fluorescence intensity was measured using the CyFlow Ploidy Analyser (Sysmex Partec), and the data were analysed using Flowing Software v.2.5.1 (Turku Bioscience Centre). Each sample was measured three times with a minimum of 3,000 particles.

To estimate the ploidy of the samples in silico, we used K-Mer Counter (KMC) (v.3.1.1)^108,109 to generate a k-mer database from FASTQ sequence files containing short-read data covering SCOs. The setting was a k-mer size of 21, a minimum count for a k-mer to be included of 1 and an upper limit for k-mer counts of 5,000. To avoid noise, KMC database reduction was performed using the transform operation with the L 30 and U 5000 settings. With smudgeplot⁶¹ analysis and its hetkzer operation, the coverages of the identified k-mer pairs were written to a ‘_coverages.tsv’ file. A custom R script with ggplot2⁷⁴ and data.table packages⁹² was used to plot the distribution of frequencies of different SNP ratio classes. For each sample, the ploidy level was then estimated by visual inspection of the plots.

Target back-mapping, variant calling and creating a sample-specific reference

The raw SCO reads were trimmed using Trimmomatic (v.0.39)¹¹⁰ with the following settings: 2:30:8 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:19 MINLEN:36. We updated the script from a previous study¹¹¹ to run it with current package versions and used it for mapping, variant calling and sample-specific reference building. In brief, the trimmed short reads from the target enrichment sequencing were mapped against the SCO targets of the R. chinensis reference genome (5,794 sequences) using the BWA program⁷³. Using SAMtools (v.1.16.1)¹¹², the reads were sorted and indexed, and duplicates were removed. Notably, approximately 98% of trimmed reads were successfully mapped to the target. Hits with exactly one alternative mapping position were subsequently filtered. After mapping, the Genome Analysis Toolkit (GATK) (v.4.1.9.0)¹¹³ was used with the operation HaplotypeCaller¹¹⁴ for variant calling, BaseRecalibrator and ApplyBQSR were used to realign around SNPs and indels, and FastaAlternateReferenceMaker was used to create a sample-specific consensus sequence as a reference for each SCO locus in each sample. The provided ploidy level for the HaplotypeCaller was 2 (diploid) for both the pollen and diploid roses with regular meiosis, and the –max-alternate-alleles flag was set to 6, so that although the pollen is monoploid, it would be possible to call potential variances.

Phylogenetic reconstructions based on SCO markers

The SCO target was used as a query for a local search with BLAST+⁶⁵ in our R. canina S27 genome assembly with a customized output table (-outfmt 6 qseqid sseqid pident length qstart qend sstart send evalue bitscore) and additional in the DToL R. canina (PRJEB79801) and R. agrestis (PRJEB79880; https://www.darwintreeoflife.org/) genomes also assembled by us (see below). Those SCO loci that had only five hits, one each on subgenomes S1_h1, S1_h2, S2, R3, R4 and R4_h1, R4_h2, R3, S2 and S1 for R. agrestis, respectively and within the same linkage group, were filtered and considered single copies. A main list of common single-copy loci for all three genomes was created to preserve the correct order and used to extract the filtered loci from the BLAST outputs with the grep command. The filtered BLAST output was then converted into a BED file containing the sequence coordinates using a bash script written with the help of GPT-4 (ChatGPT Plus, OpenAI). Using the BEDtools (v.2.30.0)¹¹⁵ command getfasta, sequences for each SCO locus were extracted from the R. canina genome assembly and written into a multifasta file. To obtain sequences with the same strand orientation, two locus lists were also created: one of the loci with a positive strand orientation and one with a negative-strand orientation. Loci with negative-strand orientation were identified by calculating the end coordinates minus the start coordinates and filtering according to negative values. According to both lists, the sequences were extracted and stored in two separate multi-FASTA files. Sequences with negative-strand orientation were reversed and complemented with SeqKit¹¹⁶ and combined with the positive strand-oriented SCO sequences in one fasta file. Finally, for each subgenome (S1_h1/S1_h2, S2, R3, R4 for R. canina and R4_h1/h2, R3, S2, S1 for R. agrestis), the extracted SCO sequences were concatenated in the same order according to the main locus list and written to subgenome-specific fasta sequences. The same procedure was used for the haploid genome assemblies of Rubus ideaus (GenBank: GCA_030142095.1)¹¹⁷ and three strawberry species, F. vesca subsp. vesca (GCA_000184155.1)⁸⁸, Fragaria iinumae (GCA_009720345.1)¹¹⁸ and F. nilgerrensis (GCA_010134655.1)⁷⁸ as outgroups. Moreover, the same single-copy loci considered in the genome assembly were extracted and concatenated in the same order with target enrichment samples from nine pollen samples and 30 leaf samples of 26 diploid rose species. The concatenated multilocus sequences were aligned using MAFFT (v.7.490)⁷⁰. Finally, a maximum-likelihood phylogenetic tree was generated by applying IQTREE⁷¹ with ModelFinder using the following settings: iqtree2 -s -m TEST –con-tree –burnin 250 -B 1000 -T 12 –wbtl. The tree figures were graphically finalized with MEGA X¹¹⁹ and Inkscape v.0.92.3 (2405546, 2018-03-11) software.

Analyses of synthetic hybrids

Synthetic hybrid R. canina (seed parent) × R. rubiginosa (pollen parent) (sample ID, D62b_2; SRA: SRR15033882) was a cross between R. canina (sample ID, D3b_2; SRA: SRR15033883) and R. rubiginosa (sample ID, D145b_2; SRA: SRR15033877), and the second synthetic hybrid R. rubiginosa (seed parent) × R. corymbifera (pollen parent) (sample ID, D166b_2; SRA: SRR15033879) was a cross between R. rubiginosa (D145b_2; SRA: SRR15033877) and R. corymbifera (sample ID, D81b_2; SRA: SRR15033881). These hybrids were originally produced by Wissemann and Hellwig⁴³ and kept as a living plant in the Botanical Garden Gießen, Germany. Whole-genome short-read sequencing was performed for both hybrids and their parental plants. The mean coverage of the maternal plant (sample ID, D3b_2) is ~27×, and the paternal plant (sample ID, D145_b2) is ~27×. The hybrid’s (sample ID, D62b_2) coverage is ~29×. The reciprocal hybrid (sample ID, D166_b2) has an average of ~30× coverage, whereas its paternal plant (sample ID, D81_b2) is ~19×.

The reads from these six samples were mapped to the S1 subgenome of our assembled R. canina, respectively, using bowtie2 (v.2.5.1)⁶³ with the default parameters. Filtering was applied for all alignments with the same setup ‘samtools view -F 3340 –min-MQ 1’. The coverage of each sample was calculated by ‘bedtools coverage’ (v.2.30.0) with a 100 kb window size. SNPs were called with the filtered alignments by bcftools (v.1.9)¹¹². Specifically, ‘bcftools mpileup’ ran first with the minimum mapping quality 1, then ‘bcftools call’ ran with flags ‘–keep-alts –variants-only –multiallelic-caller’. In the end, only the unique SNPs in each parent were selected to calculate the SNP contribution in the hybrids.

ModDotPlot analysis

Structural analysis of DNA sequences of whole chromosomes and centromere cuts were performed with ModDotPlot (v.0.9.0)¹²⁰ using the default parameters. ModDotPlot is a dot plot visualization tool designed for large sequences and whole genomes. The method outputs an identity heat map by rapidly approximating the average nucleotide identity between pairwise combinations of genomic intervals.

Gene and repeat sequence annotation

The predicted gene model structures in the nuclear genome were annotated by applying the full-length chromosome sequences to Helixer¹²¹. Moreover, complete LTR retrotransposons were annotated with the DANTE and DANTE-LTR tools implemented in RepeatExplorer2^122,123. R. canina short-read data (SRA: ERR1662939) were subjected to clustering analysis using the RepeatExplorer2 pipeline, and the output library of repeats was subsequently used to annotate the genome with the implemented RepeatMasker¹²⁴. Tandem-repeat annotation and genome abundance estimation were performed using TAREAN and TideCluster implemented in RepeatExplorer2¹²².

RNA sequencing and analysis

Total mRNA was extracted from the leaf tissue of R. canina S27 using the Spectrum Plant Total RNA-Kit (Sigma-Aldrich). The RNA-sequencing library was prepared with poly(A) enrichment and then sent for sequencing on the NextSeq 2000 platform with 2 × 150 bp mode, resulting in 33,594,132 reads. For a more accurate mapping of RNA sequences, the annotated tandem repeats and transposable elements were hard-masked from the genome. RNA alignment was done using hisat2 (v.2.1.0)¹²⁵ with the flag –no-mixed. The output was then filtered by only allowing for tag NM:0 and minimum mapping quality 2. To count the transcripts number for each gene, we converted the masked genome to protein sequences based on Helixer¹²¹ structural annotation, and then functionally annotated the protein sequences by Mercator4 (v.7.0)¹²⁶ with both Prot-scriber and Swissprot databases, then htseq-count (v.2.0.1)¹²⁷ was applied to count the transcripts for all annotated proteins. The gene expression was analysed by DESeq2¹²⁸. As the high homozygosity between the haplotypes of S1 subgenome, RNA reads were aligned to S1_h1, S2, S3, S4 genome, and the expression level of S1 was then halved (Supplementary Fig. 20 and Supplementary Table 4).

CENH3 ChIP–seq experiment and analysis

For detecting the functional centromeres of R. canina S27, we designed a specific polyclonal antibody against its CENH3 protein (ARVKHTAARKDRIKTARRQP-C, AB016310), synthetized by LifeTein with immunization in rabbits. The CENH3 gene of R. canina S27 was identified using BLASTP with the parameter ‘-evalue 1e-5 -qcov_hsp_perc 50’ and the A. thaliana CENH3 protein HTR12 (AT1G01370) was used as the reference. The ChIP experiment was performed as described previously¹²⁹ with a few modifications. Young leaves (around 2–5 g) of R. canina S27 were collected and cross-linked in 4% formaldehyde in 1× PBS on ice with vacuum infiltration applied for 30 min. The quenching was performed applying 1 M glycine in each sample followed by vacuum infiltration at room temperature for 15 min. The material was then macerated in liquid nitrogen and the chromatin was extracted. After extraction, the chromatin was sonicated for 30 min on a Bioruptor (Diagenode) until fragments of around 200–600 bp length were achieved (30 s on; 30 s off; in high mode). The sonicated chromatin was incubated over night at 4 °C with 1 µg of each polyclonal antibodies (anti-CENH3 specific for R. canina (LifeTein, AB016310) raised in rabbit and anti-histone H3 (Active Motif, 39064) raised in mouse). Samples with no addition of primary antibodies were also incubated as input control samples and at least two experimental replications were used for each ChIP combination. After incubation, protein beads (anti-rabbit: rProtein A Sepharose FastFlow 50% slurry; anti-mouse: rProtein G Sepharose FastFlow 50% slurry (GE Healthcare)) were washed and added to each complex protein–antibody and incubated for at least 2 h at 4 °C in slow rotation. The final recovered chromatin was eluted from the beads, followed by a de-cross-linking step and final DNA extraction. After quality control using the 4200 TapeStation System (Agilent Technologies), the samples were forwarded for 150 bp paired-end Illumina sequencing. For the analysis, the raw 150 bp paired-end reads were quality checked and then mapped to the R. canina haplotype phased reference genome using the default parameters in bowtie2⁶³. The BAM file was converted to bigwig using the bamCompare tool from deeptools2¹³⁰, and then normalized to reads per kilobase of transcript per million reads mapped. Peak calling was then performed using the MACS3 pipeline¹³¹ with the inclusion of the parameters –broad -g 1.9e+9. The plots showing the distribution of different genomic features per chromosome or specific region were constructed using pyGenomeTracks¹³². The ChIP–seq signals in metaplots to compare chromosome (Extended Data Fig. 7 and Supplementary Fig. 14) and subgenome CENH3 enrichment (Fig. 3e) were calculated by bamCompare with parameters ‘–ignoreDuplicates –scaleFactorsMethod readCount –operation log2’ to normalize the CENH3/H3 by read coverage.

Functional centromere annotation

Functional centromere regions in the genome assembly of R. canina S27 were annotated based on the detection of CENH3 peaks with MACS3 (see above). The total centromere length was then calculated by the interval between the 5′ and 3′ CENH3 peaks. After alignment to the annotated functional centromeres in R. canina S27, comparable centromeric regions were extracted from DToL R. canina and R. agrestis (https://www.darwintreeoflife.org). The repeat abundance of CANR4 satellite repeats and Ty3/Gypsy ATHILA retrotransposons in the predefined centromeric regions were determined in base pairs for each chromosome of the three investigated Rosa genomes (R. canina S27, R. canina DToL and R. agrestis DToL). To reduce data skewness the data were log-transformed. A Shapiro–Wilk normality test was used to check normal distribution of the data with R (v.4.3.3)⁹² (29 February 2024). A bivariate Bayesian generalized linear mixed model was implemented using the MCMCglmm package¹³³. The model included pairing type (bivalent B, univalent U and univalent in R. canina but bivalent in R. agrestis Ub) as a fixed effect, while subgenome, genome, and synteny group were random effects, with an unstructured covariance structure (us(trait):random_effect) to account for correlations between response variables. MCMC settings included 100,000 iterations, with a 50,000 burn-in and a thinning interval of 50 and the family parameter was set according to ‘gaussian’. Data visualization was performed using the ggplot2⁷⁴, patchwork¹³⁴, tidyr¹³⁵ and dplyr¹³⁶ packages (Supplementary Fig. 15; source data are available in Supplementary Data 13 and 16). The correlation of CANR4 size with CENH3 abundance (Fig. 3e) was calculated by Spearman’s rank correlation as the Shapiro–Wilk normality test resulted in P ≪ 0.5. Linear regression model was fitted using the lm function in R (v.4.4.0), with multiple R² value as 0.842 and adjusted R² value as 0.836.

cenLTR sequence characterization

cenLTR sequences were primarily annotated as tandem repeats using TAREAN and TideCluster implemented in RepeatExplorer2¹²². Further sequence similarity with LTR retrotransposons was performed using the transfer annotation tool of Geneious Prime v.2025.0.2 (https://www.geneious.com) with a minimum sequence similarity threshold of 75%. Using a Geneious Prime plugin for ClustalO¹³⁷, we performed alignments of consensus cenLTR sequences against the regions with the highest similarity found in the R. canina S27 genome, which all corresponded to different ATHILA elements on chromosomes Rca1_R4 and Rca4_R4. Consensus sequences of cenLTR1–4 are available in Supplementary Dataset 14.

DNA methylation sequencing and analysis

To investigate the methylome of R. canina, we performed enzymatic methyl-sequencing (EM-seq). For this, we extracted genomic DNA from young leaves and the samples were then prepared for an Illumina-compatible library using the NEBNext Enzymatic Methyl-seq Kit and further sequenced on the HiSeq 3000 device with paired-end orientation. We ended up with 68,632,618 pairs of 150 bp reads. EM-seq data were first aligned to the S1_h1, S2, R3, R4 combined subgenomes with Bismark (v.0.23.0) with the flag ‘–local’ and duplications were removed by deduplicate_bismark. CpG-, CHG- and CHH-context methylations were then extracted by bismark_methylation_extractor (v.0.23.0). The output was converted to bedgraph by bismark2bedGraph (v.0.23.0) with the flag ‘-CX’ activated for CHG and CHH contexts to visualize the methylations chromosome-wide and on the centromeres.

Metaplots of CENH3 enrichment, DNA methylation, ATHILA and CANR4 density

In the metaplots (Fig. 3d, Extended Data Fig. 7 and Supplementary Fig. 14), all signals were smoothed by the spline.smooth function with spar 0.3 in R (v.4.4.0). CENH3 enrichment was calculated by CENH3 ChIP–seq (log₂[CENH3/H3]) signal normalized by coverage. CENH3 enrichment, DNA methylations, ATHILA density and CANR4 were calculated in 50 kb adjacent windows and averaged by all chromosomes of the corresponding subgenome. All chromosome coordinates were scaled on the basis of their distance to centromere against the distance of centromere to telomere. Centromere position was defined on the basis of where the maximum CENH3 enrichment was located. Mitochondrial sequences were masked when computing the CENH3 enrichment. All signal values (y axis of metaplots) were scaled from 0 to 1 based on the global minimum to global maximum except for DNA methylations, for which the original percentage values were retained. The p- and q-arm values were averaged and mirrored.

Immunodetection of CENH3 and microtubules

For immunodetecting the centromeres of R. canina S27, we used polyclonal antibodies against CENH3 protein (see above) and kinetochore protein KNL1 (C-EDHFFGPVSPSFIRPGRLSD, AB015677-3) described previously⁴⁸, also synthetized by LifeTein and raised in rabbits. To identify the microtubules, we used a commercial antibody against alpha-tubulin (Sigma-Aldrich, T6199) with immunization in mouse. For analysing the distribution of these markers in mitotic cells, root tips were fixed after a pretreatment in 0.2 mM 8-hydroxyquinoline for 4 h at 18 °C. For meiotic stages, the young anthers were directly fixed with no previous antimitotic pretreatment. The immunodetection experiment was performed according to a previously published protocol¹³⁸ with modifications to R. canina material. Young flower buds were collected on ice in buffer A (15 mM PIPES–NaOH, 80 mM KCl, 0.5 mM ethylene glycol tetraacetic acid, 80 mM sorbitol, 20 mM NaCl, 2 mM ethylenediaminetetraacetic acid (EDTA), 0.15 mM spermine, 0.5 mM spermidine and 1 mM dithiothreitol) and next incubated in 4% paraformaldehyde in buffer A for 1 h under vacuum infiltration on ice. After the fixation, the samples were washed three times with buffer A and then digested in enzymatic solution containing 1% cellulase-onozuka, 1% cellulase, 1% pectolyase Y23, 1% cytohelicase, 1% macerozyme and 10% pectinase in citrate buffer for 1 h in a humid chamber at 37 °C. To remove the excess of enzymatic solution, the material was gently washed with buffer A and left on ice until the preparation of the slides. A couple of anthers were placed and dissected in a drop of buffer A on a 18 × 18 mm high-precision coverslips, a few μl of polyacrylamide solution (25 µl 15% polyacrylamide (Sigma-Aldrich, A3574) in buffer A plus 1.25 µl of 20% sodium sulfite and 1.25 µl of 20% ammonium persulfate) were added to the dissected anthers, quickly mixed and a second coverslip was put above the first making a sandwich gently squeezing the anthers with a needle to liberate the meiocytes. The sandwiches were allowed to dry for up to 1 h until complete polymerization. After this, the coverslips were carefully separated and incubated in PBS with 1% Triton X-100 and 1 mM EDTA for at least 1 h, then more 2 h in blocking solution containing 3% BSA in PBS with 0.1% Tween-20. After this period, the primary antibodies were diluted in 1:500 (CENH3 and KNL1) and 1:200 (alpha-tubulin) ratios in blocking solution and applied on each sample, which were sequentially incubated at 4 °C for 48 h. After primary antibody incubation, primary antibodies were detected using secondary antibodies conjugated with specific fluorophores (Alexa Fluor 488 and Abberior StarRed and STAROrange for STED microscopy, also diluted in blocking solution in a proportion of 1:250) and incubated in a dark humid chamber at room temperature for at least 2 h. The material was then washed four to five times for 20 min each in 1× PBS + 0.1% Triton X-100 and then mounted in SlowFade Gold medium containing DAPI. The slides were photographed using a super-resolution STED microscope (Abberior instrument facility line; https://abberior-instruments.com/) and posterior brightness and contrast adjustments were done in Photoshop.

Chromosome preparation and FISH

For mitotic chromosome preparations, root tips and young flower buds from R. canina S27 plants cultivated in the greenhouse were collected and then fixed in methanol:acetic acid solution (3:1 (v/v)) for 2–24 h at room temperature and then kept at −20 °C until use. After fixation, the root tips were pretreated with an enzymatic solution of 2% cellulase R10-onozuka (Duchefa Bioquemie)/20% pectinase (Sigma-Aldrich) in 0.1 M citric acid for 40 min at 37 °C in a humid chamber and then squashed in a drop of LB01 buffer (15 mM Tris, 2 mM Na₂EDTA, 80 mM KCl, 20 mM NaCl, 0.5 mM spermine, 15 mM β-mercaptoethanol, 0.1% Triton X-100 (pH 7.5)) and, after frozen in liquid nitrogen, the coverslips were removed.

For meiotic chromosome preparations, the anthers of R. canina C1 (GLM-P-0181117) were dissected from fixed flower buds around 0.5 cm in length. Anthers were washed with 1% (w/v) polyvinylpyrrolidone 40 (PVP-40; Sigma-Aldrich Chemie) and 0.5% (v/v) Triton X-100 for 15–20 min, followed by enzymatic digestion overnight in a humid chamber at 4 °C in 1% (w/v) cellulase Onozuka R-10 (Serva), 0.2% (w/v) pectolyase Y-23 (Sigma-Aldrich), 0.5% (w/v) hemicellulose (Sigma-Aldrich) and 0.5% (w/v) macerozyme R-10 (Duchefa Biochemie) dissolved in citric buffer (0.04 M citric acid and 0.06 M sodium citrate). The anthers were macerated on a slide, squashed in a drop of 70% acetic acid and fixed by freezing in liquid nitrogen.

For FISH experiments, a 22 bp oligo probe directly labelled with a Cy3 fluorophore at the 5′ terminus was designed, synthesized by Sigma-Aldrich and then used to detect the CANR4 satellite repeat (Cy3-5′-ACCCTAGAAGCAAGAAGTTTGG-3′) or an insert of the plasmid carrying the CANR4 dimer (GenBank MK069593) was used as a FISH probe²⁰. For detection of the centromeric LTR ATHILA retrotransposon sequences, we designed a probe based on clustering analysis of Illumina reads (SRA: ERR1662939) using the RepeatExplorer2 pipeline. It was revealed that cluster 5 (CL5) contig contained Ty3/Gypsy/ATHILA sequences. The CL5 contig was used to design PCR primers amplifying a 180 bp product from R. canina genomic DNA. The primers were as follows: Rcan_centr_CL5_for: 5′- GCAAGCGCATAATTTAACC-3′ and Rcan-centr_CL5_rev: 5′-CAATCAAAAATATCCCCCC-3′. The PCR product was purified and cloned into the pDrive vector (Qiagen) and sequenced by the Sanger dideoxy method using the SP6 primer (Micosynth). Clone 11 was submitted to GenBank (PV030978). The inserts of plasmids were directly labelled in a nick translation reaction with Cy5 d-UTP or Cy3 d-UTP fluorochromes (Jena Bioscience) and used for FISH. To detect the 5S and 35S rDNA loci, the full-length 18S rRNA gene from tomato (GenBank: X51576.1) and the Pta71 clone from Triticum aestivum were used to detect the 35S rDNA region, while a 5S rDNA unit (B variant) from R. canina³⁵ and the D2 clone from Lotus japonicus were used to detect the 5S rDNA locus. rDNA robes were directly labelled by Nick translation using Cy5 d-UTP (Jena Bioscience). The slides were prepared in accordance with the protocols described previously^35,139. In brief, the slides were treated with pepsin solution (1 mg ml⁻¹ diluted in 0.01 N HCl) for 30 min at 37 °C in a humid chamber, washed with 2× SSC (saline sodium citrate, pH 7.0) solution, post-fixed with 4% paraformaldehyde for 10 min at room temperature, washed again with 2× SSC and then dried in 70% and 100% ethanol. After air drying for at least 30 min, the slides were denatured with hybridization mix (50% formamide, 2× SSC, 10% dextran sulfate and ~50 ng of each probe (15 μl per slide)) for 5 min at 75 °C and then incubated for at least 18 h at 37 °C. After hybridization, stringency washes were performed with 2× and 0.1× SSC solutions at 42 °C, achieving around 76% stringency. The slides were then washed at room temperature with 2 × SSC solution and mounted with DAPI in the antifade mounting medium Vectashield (Vector Laboratories).

Alexander staining

Five mature and well open flowers were collected from the plant in the greenhouse. They were shaken above a microscope slide and their pollen was released on top of the slide. Then, 20 µl of Alexander staining solution (Morphisto, 13441.00250) was added and briefly mixed with the pollen by stirring with the pipette tip. A coverslip 24 × 40 mm was put on top of the mix. Pictures were taken with a Labscope microscope by Zeiss, using 10 × magnification. Five snapshots were counted with the help of the ZEN software.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Bimodal centromeres in pentaploid dogroses shed light on their unique meiosis

Plant material

Whole-genome sequencing

Genome assembly

Rca_S2 assembly correction

Assembly of DToL datasets

k-mer analysis for genome size and ploidy level estimation

Chloroplast genome assembly and phylogeny

Identification of the bivalent-forming subgenome and comparative analysis

Synteny analysis

Self-synteny and fractionation bias

dN/dS analysis

Chromosome-level phylogenetic reconstruction

Subgenome-aware phasing of R. canina

Flow cytometric determination of the endosperm/embryo ratio

SCOs

Plant material

Isolation and flow sorting of pollen nuclei

DNA extraction

Target construction

Sequencing

Target back-mapping, variant calling and creating a sample-specific reference

Phylogenetic reconstructions based on SCO markers

Analyses of synthetic hybrids

ModDotPlot analysis

Gene and repeat sequence annotation

RNA sequencing and analysis

CENH3 ChIP–seq experiment and analysis

Functional centromere annotation

cenLTR sequence characterization

DNA methylation sequencing and analysis

Metaplots of CENH3 enrichment, DNA methylation, ATHILA and CANR4 density

Immunodetection of CENH3 and microtubules

Chromosome preparation and FISH

Alexander staining

Reporting summary

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY