Site and specimen
Located in the Haughton impact crater (75° N, Nunavut, Canada), the Haughton Formation comprises the remnants of a large, post-impact lacustrine deposit, dated to the Early Miocene. Previous dating estimates, using fission-track and 40Ar–39Ar furnace step-heating dating, identified an age of 24–21 Ma (refs. 33,51). An Early Miocene age has also been corroborated by (U-Th)/He thermochronology52. Although older age estimates between 30 and 40 Ma have also been suggested53,54,55, there have been no age estimates younger than the Early Miocene. Therefore, we conservatively use the younger Early Miocene age estimates in our analysis and interpretation.
The highly endemic fauna of the Haughton Formation consists of several vertebrate taxa, including a transitional pinniped11, a pair of salmoniform fishes, a swan-like anatid, a small artiodactyl, a leporid rabbit, a heterosocid shrew and a well-preserved rhinocerotid10,36. Although the megafloral assemblage is not particularly rich, the palynofloral assemblage is well-characterized, allowing for reconstruction of local climatic conditions. In the Early Miocene, the Haughton Crater lake and its surrounding environs experienced a significantly warmer annual temperature (8–12 °C) than the present day10,56.
Specimen CMNFV59632 is a nearly complete rhinocerotid skeleton, including skull and dentition, uncovered 10.8 m above the base of the formation36. Our analysis focuses on a single tooth fragment from a lower left m1 (Fig. 1b and SuppIementary Information) that was already separated from the rest of its tooth row because of the fragmenting effectings of cryoturbation36. The dental specimen’s rhinocerotid affinities are further supported by its size and morphology (Supplementary Information), most notably the presence of vertical Hunter–Schreger bands on its enamel, a defining feature of rhinocerotids and found in few other mammals57. A single tusk fragment (left i2) derived from CMNFV59632 was also selected for proteomic extraction. Owing to its thin enamel, only limited peptides were recovered from this tusk fragment, and the sample is thus excluded from further analysis and discussion.
Proteomic extraction
The laboratory workflow for the CMNFV59632 teeth and the Fontana Ranuccio Stephanorhinus tooth (for comparison) generally follows that reported in refs. 2,35. Using a sterilized drill, flakes of enamel were removed from the fragmentary teeth, with care taken to avoid sampling the dentine. The CMNFV59632 tooth enamel sample, weighing 154 mg, was then ground to a fine powder and demineralized overnight using 10% high-performance liquid chromatography (HPLC)-grade trifluoroacetic acid (TFA) (Merck, Sigma-Aldrich) in high-purity liquid chromatography–mass spectrometry (LC–MS) grade water. The CMNFV59632 tusk enamel sample, weighing 90 mg, was processed in the same way. The Fontana Ranuccio (FR sd-295) enamel sample was divided into three subsamples—FR2, FR3 and FR4—weighing 202, 243 and 205 mg, respectively, which were similarly ground to a fine powder, and demineralized using 10% TFA (FR3, FR4) or 10% HCl (FR2) in high-purity LC–MS grade water. For each sample, the demineralization step was repeated a second time to ensure complete demineralization. No enzymatic digestion was performed. Subsequently, peptides were collected and desalted on C18 StageTips58 produced in-house. An extraction blank for each sample set was processed alongside the samples for every step, to control for contamination.
Mass spectrometry
StageTips were eluted with 30 μl of 40% acetonitrile (ACN) and 0.1% formic acid in high-purity LC–MS grade water, into a 96-well plate. To remove ACN and concentrate the samples, the plate was vacuum-centrifuged until approximately 3 μl of sample remained in each well. Next, samples were resuspended in 6 µl of 5% ACN, 0.1% formic acid in high-purity LC–MS grade water, and 4 µl (CMNFV 59632) or 5 µl (FR sd-295 Stephanorhinus), of sample were injected.
Liquid chromatography coupled with tandem mass spectrometry was used to analyse the samples, on the basis of previously published protocols2,59. Samples were separated on a 15-cm column (75 μm inner diameter in-house laser pulled and packed with 1.9-μm C18 beads (Dr Maisch)) on an EASY-nLC 1200 (Proxeon) connected to an Exploris 480 (CMNFV59632) or a Q-Exactive HF-X (Fontana Ranuccio Stephanorhinus) mass spectrometer (both Thermo Fisher Scientific), with an integrated column oven. Buffer A, containing 0.1 % formic acid in MilliQ water, and the peptides were separated with increasing buffer B (80% ACN, 0.1% formic acid in MilliQ water) with a 77-min gradient, increasing buffer B concentration from 5% to 30% in 50 min, 30% to 45% in 10 min, 45% to 80% in 2 min, and maintained at 80% for 5 min before decreasing to 5% in 5 min, and finally held for 5 min at 5%. Flow rate was 250 nl min−1. An integrated column oven was used to maintain the temperature at 40 °C.
The two mass spectrometers were run using the same parameters except where specified, owing to changes in running software. Spray voltage was set to 2 kV, the S-lens RF (radio frequency) level was set to 40%, and the heated capillary was set to 275 °C. Full-scan mass spectra (MS1) were recorded at a resolution of 120,000 at m/z 200 over the m/z range 350–1,400. The AGC (automatic gain control) target value was set to 300% (Exploris) or 3 × 106 (HF-X) with a maximum injection time of 25 ms. HCD (higher-energy collisional dissociation) -generated product ions (MS2) were recorded in data-dependent top-10 mode and recorded at a resolution of 60,000. The maximum ion injection time was 118 ms (Exploris) or 108 ms (HF-X), with an AGC target value of 200% (Exploris) or 2 × 105 (HF-X). Normalized collision energy was set at 30% (Exploris) or 28% (HF-X). The isolation window was set to 1.2 m/z with a dynamic exclusion of 20 s. A wash-blank, using 5% ACN, 0.5% TFA, was run between each sample and laboratory blank to limit cross-contamination.
Database construction
The protein reference alignment given in ref. 2 was used as a starting point to construct a database for sequence reconstruction. Owing to the vast evolutionary distance between CMNFV59632 and any extant taxa (>20 Myr), a broader database was constructed to identify sequence variants that may be known in other mammals. To construct a broader database, we searched UniProt and the National Center for Biotechnology Information for each enamel protein, specifying the taxonomic grouping of ‘Theria’ to include all therian mammals. To supplement available sequences, others were manually extracted from available genomes, following the methodology reported previously60.
To investigate the relationships at the base of Rhinocerotidae, protein sequences translated from Elasmotherium sibiricum genomic data31 were generated. To obtain the corresponding amino acid sequences, we first collapsed the paired-end reads and masked the conflict bases as ‘N’ using adapterRemoval61. We then mapped the collapsed reads against the reference genome of the white rhinoceros (GCF_000283155.1_CerSimSim1) using the BWA MEM function62 with the shorter split hits being abandoned. After that, we removed duplicates using an in-house Perl script following ref. 31. Finally, we extracted the gene sequences according to their locations on the reference genome.
The remaining steps generally follow the workflow outlined in ref. 2. We used ANGSD63 to generate consensus sequences from BAM files corresponding to chromosomes that include genes of interest. To reduce the effects of post mortem aDNA damage, we trimmed the first and last five nucleotides from each DNA fragment. We formatted each consensus sequence as a blast nucleotide database. To recover translated protein sequences, we performed a tblastn alignment64, with the corresponding Ceratotherium simum sequences as queries. Finally, we used ProSplign to recover the spliced alignments, and ultimately, the translated protein sequences65.
Protein identification
Thermo Fisher Scientific .raw files generated using the mass spectrometers were searched with various software using an iterative search strategy to interpret spectra, characterize modifications and ultimately, reconstruct protein sequences. For comparison, .raw files from a medieval ovicaprine (control) and an Early Pleistocene Stephanorhinus generated previously2 were also analysed. Among samples from the Fontana Ranuccio Stephanorhinus, only FR4 was analysed. Although it is possible that some inter-sample variation between CMNFV59632 and these samples is caused by analysis using different mass spectrometer models, according to benchmarking studies66,67, it is unlikely that the overall trends and interpretations would change.
We primarily used MaxQuant68 for sequence reconstruction and other downstream aspects of data analysis. We performed two initial runs: (1) a more focused run using the database we modified from ref. 2, and (2) a broad run using the ‘Theria’-wide database we constructed from publicly available sequences.
In all runs, an Andromeda score threshold of 40 and a delta score of 0 were set for both unmodified and modified peptides. Minimum and maximum peptide lengths were specified as 7 and 25, respectively. The default peptide false discovery rate was used (0.01), whereas the protein false discovery rate was increased to 1 to show possible low-abundance proteins. Error tolerances were kept at the default settings for Orbitrap MS instruments: 20 ppm for the first search, 4.5 ppm for the final search and 20 ppm for the fragment ion. ‘Unspecific’ digestion was specified. No fixed post-translational modifications were set. Several modifications were set as variable modifications in our initial runs: glutamine and asparagine deamidation (delta mass (ΔM) = +0.984016), methionine and proline oxidation (ΔM = +15.9949), N-terminal pyroglutamic acid from glutamine (ΔM = −17.026549) and glutamic acid (ΔM = −18.010565), phosphorylation of serine, threonine and tyrosine (ΔM = +79.966331), and the conversion of arginine to ornithine (ΔM = −42.021798).
Proteins included in the database of common contaminants provided by MaxQuant (for example, proteinaceous laboratory reagents and human skin keratins), as well as reverse sequences, were removed manually and not examined further. In addition, proteins detected in the laboratory blank were also treated as contaminants, and not considered further.
To discover new SAPs and peptide variants not included in our database, we used more search tools. Peaks v.7.0 was used to attempt de novo sequencing and an homology search was performed using the SPIDER algorithm69,70,71. The open search capabilities of openPFind72 and MSFragger73 were also used. When possible, the same settings were selected as in the MaxQuant runs.
With our iterative search strategy, we integrated possible sequence variants from the results of our de novo, homology searches and open searches into hypothetical sequences from closely related taxa, to produce artificial sequences. These artificial sequences were included in a subsequent MaxQuant search, and only incorporated into reconstructed sequences if identified and validated using MaxQuant.
Sequence reconstruction and filtering
Before sequence reconstruction, all non-redundant PSMs were filtered using three criteria to reconstruct only those peptide sequences and amino acid residues that we can confidently assign. Sequences were accepted at two levels, resulting in two different datasets: (1) a minimally filtered dataset, and (2) a strictly filtered dataset. This filtering starts with using Basic Local Alignment Search Tool (BLAST)74 to determine whether peptides match any contaminants, beyond those included in MaxQuant by default, such as soil bacteria and fungi. At this stage, PSMs are discarded if they present a match to any reasonable candidate contaminants, if they are also identified in the blank, or if they present poorly covered ion series. The resulting PSMs are used to reconstruct sequences for the ‘minimally filtered dataset’.
Next, ion series coverage is examined for each PSM. At this stage, peptide sequences are accepted for the strictly filtered dataset only if each amino acid residue is covered (for example, at least y-, b- or a- ion designates the mass of that specific amino acid, plus any identified modifications) by at least two spectra, following the approach outlined previously75. Also, for both strictly and minimally filtered datasets, poorly supported spectra are removed at this stage, and proteins are only submitted for phylogenetic analysis if they are covered by at least two non-overlapping peptides. Finally, under the strict filtering criteria, BLAST is used again on any trimmed sequences, to remove any that match contaminants.
Intra-crystalline protein decomposition analysis
We analysed chiral amino acids on CMNFV59632 to evaluate the overall extent of amino acid degradation in the intra-crystalline fraction of the enamel, enabling comparison with previously analysed specimens2, and samples that had been heated experimentally to between 60 and 80 °C for up to 17,520 h and with samples heated to 200–500 °C for up to 25 min. Enamel chips were drilled using a Dremel 4000 (kit 4000-1/45) drill with a diamond wheel point (4.4 mm (7105) by Dremel) to remove any dentine, which could be identified under a microscope (ZEISS Stemi 305, Axiocam 105 R2). Samples were processed following the methods in ref. 76. To remove excess powders, enamel chips were washed in deionized water and ethanol (analytical grade) before being powdered in an agate pestle and mortar. Powdered samples were weighed into a single plastic microcentrifuge tube and bleached (NaOCl, 12%, 50 μl mg−1 of enamel) for 72 h to remove the inter-crystalline amino acids and any contamination. This bleached sample was washed five times with deionized water and then once with methanol (HPLC grade), before being left to dry overnight.
The dried bleached sample was then divided into four subsamples: two for technical replicate analysis of the FAA and two for replicate analysis of the THAA. The THAA subsamples were dissolved in HCl (7 M, 20 μl mg−1, analytical grade) in a sterile 2-ml glass vial (Wheaton), purged with N2 to reduce oxidation and heated at 110 °C for 24 h in an oven (BINDER series). The acid was then removed by centrifugal evaporation (Christ RVC2-25). THAA and FAA fractions were subjected to a biphasic separation procedure76,77 to remove inorganic phosphate from the enamel samples. HCl was added to both FAA (1 M, 25 μl mg−1) and THAA (1 M, 20 μl mg−1) fractions in separate 0.5-ml plastic microcentrifuge tubes (Eppendorf), and KOH (1 M, 28 μl mg−1) was added into the acidified solutions, which then formed monophasic cloudy suspensions. Samples were agitated and then centrifuged (13,000 rpm for 10 min, Progen Scientific GenFuge 24D) to form a clear supernatant above a gel. The supernatant was removed and dried by vacuum centrifugation. The concentration of the intra-crystalline amino acids and their extent of racemization (d/l value) were then quantified using reverse-phase HPLC (Agilent 1100 series HPLC fitted with HyperSil C18 base deactivated silica column (5 μm, 250 × 3 mm) and fluorescence detector) following a modified method from ref. 78.
For the reverse-phase HPLC analysis, samples were rehydrated with an internal standard solution (l-homo-arginine (0.01 mM), sodium azide (1.5 mM) and HCl (0.01 M)), and run alongside standards and blanks. A tertiary mobile phase system (HPLC grade ACN–methanol–sodium buffer; 21 mM sodium acetate trihydrate, sodium azide,1.3 μM EDTA, pH adjusted to 6.00 ± 0.01 with 10% acetic acid and sodium hydroxide) was used for analysis. The d and l peaks of the following amino acids were separated: aspartic acid and asparagine; glutamic acid and glutamine; serine, alanine, valine, phenylalanine, isoleucine, leucine, threonine, arginine, tyrosine and glycine. During preparation, asparagine and glutamine undergo rapid irreversible deamidation to aspartic acid and glutamic acid respectively79 and hence they are reported together as aspartic acid and asparagine, and glutamic acid and glutamine. One of the experimentally heated samples (300 °C for 10 min) was also analysed using liquid chromatography coupled with tandem mass spectrometry with minor changes to the protocol reported in ref. 2 (Extended Data Fig. 7 and Supplementary Information).
Phylogenetic analysis
A time-calibrated phylogenetic tree was inferred with the Bayesian phylogenetic software RevBayes v.1.2.1 (ref. 50) (https://revbayes.github.io/) under a constant-rate FBD model80,81. The dataset consisted of enamel proteome data for 16 perissodactyl species (10 extant and 6 extinct), totalling 7 proteins and 3,446 amino acids. Phylogenetic analyses were performed with both the strictly filtered and minimally filtered sequences for CMNFV59632, to observe any topological differences between the two datasets and assess whether filtering is warranted. Because no main differences were observed, only the results from the ‘strictly filtered’ dataset are discussed. The proteome dataset was partitioned by protein. A General Time Reversible + Invariant sites (GTR + I) amino acid substitution model—in which stationary frequencies of the 20 amino acids and exchangeability rates among amino acids are free to vary and estimated from data—was applied to each partition. Preliminary unrooted phylogenetic analyses performed on each protein showed evidence for within-protein Γ-distributed rate variation only for MMP20, hence Γ-distributed rate variation was modelled only for the MMP20 partition. A relaxed clock model with uncorrelated lognormal-distributed rates was applied to allow rate variation across branches. The prior on the average clock rate was set as a log uniform distribution (min = 10−8, max = 10−2 substitutions per lineage per million years). The prior on the clock rate standard deviation was set as an exponential distribution with mean equal to 0.587405, corresponding to one order of magnitude of clock rate variation among branches. The FBD tree model allows for placement of extinct species in a phylogenetic tree while simultaneously estimating the rates of speciation, extinction and fossilization (sampling of species in the past). The priors on speciation, extinction and fossilization parameters were set as uniform distributions bounded between 0 and 10. The sampling probability for extant species was fixed to 0.5882353 (\(\frac{10}{17}\)), corresponding to the fraction of extant perissodactyl species included in the analysis, and assuming uniform sampling of extant taxa. The three species of Equidae in the analysis (Equus caballus, Equus przewalskii and Equus asinus) were constrained as outgroup to other perissodactyls (Tapiridae and Rhinocerotidae). Tip ages of fossil taxa were given a uniform prior distribution ranging from the minimum to maximum age of the deposit in which each fossil has been found. The prior on the origin age of the tree was set as a uniform distribution with minimum = 54 Ma, corresponding to the oldest fossil that can be unequivocally assigned to crown Perissodactyla (Cambaylophus vastanensis from the Early Ypresian Cambay Shale82), and maximum = 100 Ma, corresponding to the beginning of the Late Cretaceous and a very lax upper boundary on the origin of placental mammals83. Further constraints on node ages on the basis of the fossil record of perissodactyls were set to improve the precision of divergence age estimates. Each node calibration was set up as a soft-bounded uniform distribution with normally distributed tails, with 2.5% of the distribution younger than the minimum age (allowing for potential misattribution of the oldest fossil of a clade) and 2.5% of the distribution older than the maximum age. Monophyly was not enforced when setting up these node calibrations. The following age constraints have been applied to five nodes: (1) Node = crown Perissodactyla; soft minimum = 54 Ma, with the same justification as the minimum on the origin age prior; soft maximum = 66 Ma, corresponding to the Cretaceous/Palaeogene boundary, before which no unambiguous crown placental fossils are known. (2) Node = Rhinocerotina (crown rhinoceroses); soft minimum = 22.6 Ma, corresponding to the earliest putative appearance of a crown rhinoceros in the fossil record (Gaindatherium cf. browni from the Aquitanian upper member of the Chitarwata Formation84,85); soft maximum = 44 Ma, corresponding to the minimum age of Rhinocerotidae as supported by fossil and phylogenetic evidence31. (3) Node = Diceroti (Ceratotherium + Diceros); soft minimum = 5.3 Ma, corresponding to the minimum age of the oldest deposits yielding Diceros bicornis fossils (Lothagam and Albertine86,87); soft maximum = 7.3 Ma, as in ref. 31. (4) Node = Rhinoceros unicornis + Rhinoceros sondaicus; soft minimum = 1.9 Ma, corresponding to the Early Pleistocene appearance of Rhinoceros unicornis in the fossil record88,89; soft maximum = 5.3 Ma, as in ref. 31. (5) Node = Dicerorhinus + Stephanorhinus + Coelodonta; soft minimum = 13 Ma, corresponding to Middle Miocene remains of Dicerorhinus from the Middle Siwaliks of Pakistan31,90; soft maximum = 22.6 Ma, corresponding to the oldest crown rhinoceros fossil as in the soft minimum of calibration 2.
The Markov chain Monte Carlo was set up as four independent runs, running for 50,000 iterations and sampling every 10, averaging between 262.2 and 279.2 moves per iterations. Convergence between runs was checked by visually inspecting and calculating effective sample sizes of parameter estimates on Tracer v.1.7.2 (ref. 91). A MAP tree was calculated to summarize the posterior distribution of trees, with 20% burn-in. In the analysis of the minimally filtered dataset, one of the four runs was discarded from the MAP tree calculation, as it converged only in the last 10% of the Markov chain Monte Carlo.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.