Whole-genome duplication shaped cell-type evolution in the vertebrate brain

Vertebrate scRNA and snRNA atlas collection, filtering and preprocessing

Cell atlases were retrieved from previous publications^2,3,4,5. Low-quality cells in the human atlas were further filtered based on nCount (UMI) < 400. Low-quality cells in the other atlases were already filtered. To focus on neural cells in the brain, vertebrate datasets were filtered to retain only brain tissues at juvenile or adult stages. To help balance the number of cells for cross-species integration and to accommodate different proportions of neurons and glia, we randomly downsampled the human and lizard atlases to 10⁵ neurons and 10⁵ non-neurons, but retained the full brain atlases for mouse (67,937 neurons and 60,395 non-neuronal cells) and lamprey (18,166 neurons and 41,472 non-neuronal cells). Only protein-coding genes were retained for downstream analyses.

As the original atlases were generated using different pipelines, we applied a standardized preprocessing approach to ensure consistency. We performed SAM analysis on each individual atlas by directly invoking the SAMAP function from the SAMap package (which runs SAM internally)²⁰. Specifically, UMI counts from each cell were first normalized to give the median total count per cell, then log₂-transformed followed by applying the SAM function with the following parameters: preprocessing=“StandardScaler”, npcs=100, weight_PCs=False, k=20, n_genes=3000, weight_mode=‘rms’. The anndata objects were then converted to Seurat format for downstream clustering.

Amphioxus sample collection, scRNA and snRNA library construction and raw data processing

Amphioxus (B. floridae) were obtained from a stock maintained by J.-K. Yu originating from Tampa, Florida. The amphioxus and their offspring were maintained at Xiamen University under previously described conditions⁶⁵. The brain (anterior to the first dorsal ocellus) and neural tube (posterior to the first dorsal ocellus) were dissected as previously described²⁹. We constructed and sequenced one scRNA-seq and one snRNA-seq library for each tissue.

For the scRNA-seq experiment, the dissected brain (from ten adult individuals) and neural tube (from eight adult individuals) tissues were respectively washed three times in ice-cold calcium-free and magnesium-free artificial seawater (CMF-ASW)⁶⁶ and then transferred into 500 µl enzyme mix (10% trypsin and 2 mg ml^–1 collagenase in CMF-ASW) and incubated in a 37 °C incubator with a nutating shaker for approximately 10 min. During digestion, tissues were gently pipetted every 1–2 min to facilitate dissociation, and progress was monitored under an inverted microscope. Digestion was terminated by adding 1 ml of an ice-cold quenching solution (20% fetal bovine serum and 2 mg ml^–1 glycine in CMF-ASW). Cells were passed through a 40 µm cell strainer and centrifuged at 270g at 4 °C for 5 min. The supernatant was removed, and 500 µl RNase-free 0.04% BSA in 3× PBS was added to resuspend the cells. Calcein-AM (BD Biosciences, 564061) was added to the cell suspension to a final concentration of 10 µM and incubated at 37 °C for 5 min. The cells were subsequently placed on ice then immediately processed. scRNA-seq library construction was carried out in accordance with a previous study²⁹. The final libraries were sequenced on an Illumina NovaSeq 6000 platform.

For the snRNA-seq experiment, we used a Nucleus Isolation kit (SHBIO, 52009-10) to obtain single nuclei of the dissected tissues. RNase inhibitors (Sigma, 3335399001) were added to the reagents before use. The samples were cut and transferred to a 5 ml tube containing lysate, mixed and lysed for 2 min on ice, then filtered through a 40 μm cell filter (Sigma, BAH136800040). The nucleus count was estimated using a microscope (Leica) with DAPI reagent. After staining with 0.4% trypan blue (Sangon Biotech E607320-0001), the nucleus was observed under a ×40 microscope (Jiangnan Novel Optics XD-202). Subsequent experiments were performed if the nuclear envelopes were intact and there were few impurities. snRNA-seq libraries were prepared using a SeekOne DD Single Cell 3′ library preparation kit (SeekGene, K00202). In brief, an appropriate number of cell nuclei was mixed with reverse transcription reagent and then added to a sample well in a SeekOne DD chip S3. Subsequently, barcoded hydrogel beads and partitioning oil were dispensed into corresponding wells separately in the chip S3. After emulsion droplet generation, reverse transcription was performed at 42 °C for 90 min and inactivated at 85 °C for 5 min. Next, cDNA was purified from broken droplets and amplified by PCR. The amplified cDNA product was then cleaned, fragmented, end-repaired, A-tailed and ligated to a sequencing adaptor. Finally, indexed PCR was performed to amplify the DNA representing the 3′ polyA part of expressing genes, which also contained the cell barcode and the unique molecular index. The indexed sequencing libraries were cleaned up using VAHTS DNA Clean Beads (Vazyme N411-01), analysed by a Qubit (Thermo Fisher Scientific, Q33226) and a Bio-Fragment Analyzer (Bioptic, Qsep400). The libraries were then sequenced on a GeneMind SURFSeq 5000 with PE150 read length.

Raw reads from scRNA-seq were processed using the BD Rhapsody WTA analysis pipeline (v.1.12.1; https://bitbucket.org/CRSwDev/cwl/src/master/) on the Seven Bridges platform (https://sevenbridges.com/). Raw reads from snRNA-seq were processed using the SeekSoul Tools pipeline. scRNA and snRNA expression matrices for each sample were then filtered and processed using Seurat (v.5.0.0). Cells or nuclei with fewer than 300 detected genes, more than 4,000 detected genes, more than 10,000 UMI detected or more than a 10% MT expression ratio were filtered out (we used stricter parameters for neural tube processed by SeekGene due to its higher ambient RNA).

Clustering and annotation

To find good-quality and high-resolution cell clusters in the SAM preprocessed atlases, we performed hierarchical and iterative clustering for individual vertebrate cell atlases using the scrattch.hicat and scrattch.bigcat packages^67,68 from the Allen Institute. Raw counts (UMI) were first normalized using the cpm function provided in the above packages, followed by log₂ transformation with a pseudo-count added to prevent log₂[0]. Cells were initially classified into broad groups and hierarchically clustered on the basis of the expression of highly variable genes, principal component analysis and Jaccard–Louvain clustering. Clustering was performed iteratively in each group using the iter_clust function, continuing until no further subclusters satisfied predefined thresholds for the number of DEGs or minimum cluster size. As our analysis did not aim to resolve extremely fine-scale cell types, we applied more relaxed parameters than those typically used with this method. DEG thresholds were defined via the de_param settings: padj.th = 0.05, q1.th = 0.4, q2.th = NULL, q.diff.th = 0.5, de.score.th = 100, min.cells = 100, and min.genes = 6. Dimensionality reduction and clustering parameters were specified as follows: dim.method = “pca”, max.dim = 80, method = “louvain”. Minimum cluster sizes were set via split.size as 800, 500, 500 and 500 for human, mouse, lizard and lamprey datasets, respectively. As we did not aim to study cell types at very high resolution, we tuned the split.size parameters for each species to generate cluster numbers at a similar level across vertebrates. Clusters were then checked and merged at the end of the iteration to ensure that they were separable with scrattch.bigcat::merge_cl. We simply used Seurat FindClusters with the Louvain algorithm and resolution = 1 for amphioxus owing to the limited number of cells and nuclei in the datasets.

We next confirmed and refined the annotation of individual vertebrate atlases by examining the expression of canonical markers (Supplementary Table 1 and Supplementary Fig. 2a–d), reference annotation in our clustering and their main dissection locations for vertebrates (Supplementary Table 2). We annotated amphioxus brain cell types on the basis of markers (Supplementary Table 1 and Supplementary Fig. 2e), mapped them to CNS cell types at the late neurula stage with MetaNeighbor (Extended Data Fig. 2a) and summarized the data into Supplementary Table 2.

Atlas integration and cross-species mapping

Homologous gene relationships for initial weighting gene–gene graphs with cross-species edges in SAMap were generated by blast on protein-coding genes using SAMap map_gene.sh. We then performed cross-species mapping using the SAMap run function with five iterations, with the edge weight calculated and updated by Pearson’s correlation (hom_edge_mode = “pearson”) and 30 cross-species edges per cell (crossK = 30). Mutual nearest neighbourhoods were independently calculated between each pair of species (pairwise=True). For chordate comparison, we randomly downsampled 1,500 cells for each major cell-type family in vertebrates. We then used the same parameters for SAMap mapping in chordates but with crossK = 20 owing to the low cell numbers in the amphioxus data. The alignment scores between cell types across species were calculated using get_mapping_scores from SAMap. We next used the GenePairFinder function to identify gene pairs (genes between species) that positively contributed to cross-species correlation between cell types and were differentially expressed in respective atlases.

Identification of cell-type-specific TFs and conserved sets for cell-type families

To identify TF-coding genes for each species, we used DeepTFactor⁶⁹, a deep-learning-based tool optimized for TF prediction. Cell-type-specific TFs were identified for each major cell-type family in vertebrates using NS-Forest (v.4.0)⁷⁰, a method designed to identify minimum combinations of necessary and sufficient marker genes for distinguishing different cell types. This method uses a random forest algorithm on preselected genes by binary scoring, a measurement of binary expression (specificity) for a gene. For our analysis, we used the binary score to rank TF specificity and extracted the top 30 TFs with the highest scores as cell-type-specific TFs with the nsforesting.NSForest function and the following parameters: gene_selection = “BinaryFirst_high”, n_top_genes = 30, n_binary_genes = 30, n_trees = 1500.

We next assigned cell-type-specific TFs to individual orthogroups and defined an orthogroup as a conserved TF orthogroup for cell-type families if at least three (out of four) vertebrates contained these TFs. This approach identified 81 orthogroups (Supplementary Table 3-1), which we manually reviewed for expression patterns across species and summarized in Supplementary Table 3-2 with supporting references. The orthogroups generated by OrthoFinder were uploaded to GitHub (https://github.com/DiracZhu1998/WGD2celltype_evolution).

For the analysis of conserved TF orthogroups between sister cell types (Fig. 3, related analysis), we first subset 5,000 cells per group and identified markers for each sister cell type using Seurat FindAllMarkers with only.pos = T at individual species. For the comparison between oligodendrocytes and ependymo-astrocytes, we limited our analysis to oligodendrocytes and astrocytes, as ependymal cells are far less numerous than astrocytes. Markers were then filtered with adjusted P < 0.05 and average log₂[fold change] ≧ 0.58 and percentage of cells expressing that gene in the foreground > 0.1. We retained only TFs and defined an orthogroup as a conserved TF orthogroup for astrocytes versus ependymal cells if at least three (out of four) vertebrates contained these TFs. For the comparison between astrocytes and oligodendrocytes and between astrocytes and OPC, a conserved TF orthogroup was considered conserved if at least two (out of three) amniotes contained these TFs.

Classification of TFs and TF enrichment analysis

TF families were downloaded from AnimalTFDB (v.4.0)⁷¹ (http://guolab.wchscu.cn/AnimalTFDB4/#/Download). To classify TF gene families in species not represented in the database, we assigned TF family classifications at the orthogroup level using OrthoFinder output. If any human gene in an orthogroup was annotated with a specific TF family, we classified the entire orthogroup under that family. The high overlap (>90%, not shown) in classifications based on model organisms (human, mouse and zebrafish) validated the robustness of this approach. The enrichment of a TF class was assessed using hypergeometric tests with the stats::phyper function for each TF class in each species. P values were further adjusted using the p.adjust(method = “fdr”) from the R Stats package.

Identifying gene relationships for orthologues, paralogues, ohnologues and SSD paralogues

To identify gene relationships, we first collected genome assemblies and gene annotation files for the species listed in Supplementary Table 9. For each protein-coding gene, only the transcript with the longest coding sequence (CDS) was retained. CDSs were extracted from genomes based on gene annotation files and translated into proteins with in-house scripts. We then performed phylogenetic orthology inference with OrthoFinder (v.2.5.5)^72,73. The species tree inferred from orthogroups matched with references (data not shown). Orthologues were identified on the basis of OrthoFinder output, applying a reciprocal best hit criterion. (In-)paralogues were defined as duplicated genes in the same orthogroup for each species. Ohnologues were identified based on Ohnologs (v.2.0)⁷⁴ (details provided at GitHub (https://github.com/SinghLabUCSF/Ohnologs-v2.0), together with updates of ohnologues used (https://github.com/DiracZhu1998/WGD2celltype_evolution/tree/main/2.gene_relationships/ohnolog_inferring and see Supplementary Text for the evaluation of ohnologue detection) with a similar number of vertebrates used and updated genome and annotations. Owing to the limited availability of data for jawless vertebrates and the extensive loss of duplicated genes in this lineage¹, ohnologue identification in lamprey remains challenging. As jawed and jawless vertebrates independently underwent the second round of WGD, we tried ohnologue detection with lamprey and without the inclusion of lamprey and found little difference between the outcomes (<0.2%). We also tried two other methods, doubletrouble⁷⁵ and DupGen_Finder⁷⁶, but they were not comparable to Ohnologs (v.2.0) or with previous results¹ regarding the number of identified ohnologues and stability in different vertebrates (see details of the comparison in the Supplementary Text). Nevertheless, recent studies^7,30 suggest that the second round of WGD in jawed vertebrates probably involved interspecific hybridization, which resulted in asymmetric gene loss. Specifically, genes from the alpha parental lineage were around four times more likely to be retained than those from the beta lineage (based on results in chicken; https://raw.githubusercontent.com/fmarletaz/hagfish/refs/heads/main/Paralogons/Vert_Evt_OGrrA.txt).

To assess the robustness of our ohnologue predictions, we compared our results to the Ohnologs (v.2) database (http://ohnologs.curie.fr), finding that 75% of human and 70% of mouse ohnologues in our dataset were also present in the database, and vice versa (see the methodology comparison in the Supplementary Text for more details). SSD paralogues were defined as paralogues that are not ohnologues.

Paralogue gene age classification

Protein sequences were generated as described above. We aligned two protein sequences for each ohnologue and SSD paralogue pair using MAFFT (v.7.520)⁷⁷ with the L-INS-I option (–localpair –maxiterate 1000) and converted the protein alignment into a codon alignment using PAL2NAL⁷⁸. Then we used KaKs_Calculator (v.2.0)⁷⁹ to calculate K_a (the rate of nonsynonymous substitutions), K_s (the rate of synonymous substitutions) and K_a/K_s values. K_a and K_a/K_s values were used in other analyses. K_s between paralogue pairs is used to estimate their duplication time⁷⁶. For SSD paralogues, we estimated the duplication age based on a previous simulation in which K_s = 0.01 per million years (in other words, K_s = 1 is approximately 100 million years ago)⁸⁰. We also retrieved duplication age information from Ensembl BioMart based on gene trees and compared the two metrics, which showed overall consistency (data not shown). It is worth noting that both approaches involve some imprecision: gene trees depend on the available taxa and on thresholds used to cluster genes into trees (similar to orthogroup classification), whereas K_s reflects the onset of divergence between duplicates (that is, related to the rediploidization time⁸¹).

As the two rounds of WGD in early vertebrate evolution are so close, we cannot use this method to separate 1R and 2R ohnologues. We instead separated ohnologues in jawed vertebrates into alpha and beta categories based on chicken orthology assignments from previous work¹.

Identification of marker genes at the cell-type family and cluster level

To reduce the potential influence of imbalanced cell-type numbers in vertebrates, we randomly subsampled 3,000 cells for each category during marker identification. Owing to the limited cell numbers in amphioxus clusters, we did not subsample amphioxus clusters during marker detection. Marker genes were identified for each species using the FindAllMarkers function of Seurat (v.5.0.0)⁸² with the Wilcoxon rank-sum test (min.pct = 0.01, logfc.threshold = 0.58, test.use = ‘wilcox’, only.pos = TRUE) at both the cell-type family level and cluster level. For related downstream analyses, only marker genes with FDR < 0.01 were used. As a few studies^83,84 previously questioned the quality of the Seurat ‘wilcox’ output, we also identified markers using FindAllMarkers with ROC analysis (test.use = “roc”, only.pos = TRUE), which led to the same conclusions (data not shown but listed in GitHub and Figshare).

Gene regulatory network analysis

We performed gene regulatory network analysis and identified regulons using pySCENIC^85,86. To reduce noise introduced by the imbalance in the number of cells in each major cell-type family, we first randomly subset 2,000 cells for each major cell-type family. To reduce noise of lowly expressed genes, we filtered genes expressed by fewer than 0.5% of cells and with low total UMI (equivalent to 1 UMI detected in 1% cells).

The grn command in pySCENIC was used to infer gene–gene co-expression relationships between TFs and their potential target genes with grnboost2 algorithm. This process returned an adjacency edge list with the TF, its potential target gene and an associated importance score. The adjacency edge list was then used as input for the ctx command to identify regulons, each consisting of a TF and its target genes enriched for the binding motifs of the TF. Human and mouse TF lists were downloaded (https://resources.aertslab.org/cistarget/tf_lists/). The ctx command uses a motif annotation database and ranking databases, both of which were downloaded from Aerts Laboratory’s cistarget resources (motif ranking datasets: https://resources.aertslab.org/cistarget/databases/old/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based and https://resources.aertslab.org/cistarget/databases/old/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/; and motif annotation files: https://resources.aertslab.org/cistarget/motif2tf/). Next, the aucell command was used to compute regulon activity scores for each major cell-type family, and a regulon specificity score (RSS) was calculated using the regulon_specificity_scores function. The top regulons for each cell type were selected on the basis of the RSS.

GO annotations and enrichment analyses

Owing to the lack of recent updates for the GO annotation of lizard (Pogona vitticeps) lamprey (Petromyzon marinus) and amphioxus (B. floridae), we re-annotated the GO annotations for these three species. GO annotations for the protein-coding genes of model organisms (Danio rerio, M. musculus and H. sapiens) were downloaded from Ensembl through BioMart. GO terms were associated with protein-coding genes from Pogona vitticeps, Petromyzon marinus and B. floridae according to their one-to-one orthologues in H. sapiens, M. musculus and D. rerio in an order of priority (human > mouse > zebrafish). The lizard, lamprey and amphioxus genes that could not be annotated using the above method were then BLAST-searched to the UniProtKB database⁸⁷ (release-2024_03) using BLAST (2.9.0+)⁸⁸ with parameters (-evalue ×10^–8). The best hit for each query was selected based on a bit score and its corresponding GO terms (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz) assigned to the respective query. In total, we annotated nearly all protein-coding genes for lizard and over 70% of lamprey and amphioxus. This level of annotation is higher than for GO in Ensembl for another amphioxus species, Branchiostoma lanceolatum, for which more than half of the protein-coding genes are not functionally annotated.

The datasets of GO annotations for lizard, lamprey and amphioxus were built using makeOrgPackage function from the AnnotationForge package⁸⁹. The dataset packages for human and mouse were retrieved from Bioconductor (v.3.20) at org.Hs.eg.db⁹⁰ and org.Mm.eg.db⁹¹, respectively. GO enrichment analysis was performed with clusterProfiler⁹² enrichGO with Benjamini–Hochberg adjustment and cutoff = 0.05. Only protein-coding genes expressed in corresponding datasets were used as background genes in the GO enrichment analysis. The redundancy of enriched terms was filtered by simplify() with the following parameters: cutoff=0.7, by = “p.adjust”, select_fun=min.

Classification of protein class and over-representation analysis

To investigate protein class in ohnologues and SSD paralogues, we used ‘Functional classification viewed in graphic charts’ with bar plots in the PANTHER database⁹³ (v.19.0; https://www.pantherdb.org). Over-representation analysis was conducted using the ‘Statistical Overrepresentation Test’ in the PANTHER website, with protein-coding genes in our datasets as the background. Fisher’s exact test was applied with FDR correction to assess significance. Owing to the absence of corresponding data for lizard, lamprey and amphioxus in the PANTHER database, this analysis was limited to human and mouse.

Cross-species cell-type tree

We filtered orthogroups to retain those containing at least one gene in each of the five species. To minimize the potential influence of high copy-number SSDs, only orthogroups with fewer than or equal to five gene copies for each species were retained. We defined metagenes by summing the UMI counts of all gene copies in each orthogroup for each species. Expression normalization and identification of 3,000 highly variable metagenes were performed using Seurat’s SCTransform function⁸². We retained metagenes that were both highly variable in at least three out of the five species and were TF metagenes (if TFs were in that metagene or orthogroup).

Cross-species comparisons of cell-type-specific gene expression were based on gene specificity indices calculated using a previously developed method¹⁶. In brief, for each metagene g in cell-type c, we computed its specificity index (s) as the mean expression in c divided by its mean expression across all cells:

$${s}_{g,c}=\frac{{g}_{c}}{\left(\frac{1}{N}\sum _{i\in c}{g}_{i}\right)}$$

This formula shows that the number of cells per category matters. To control the cell number imbalance in cell types, we subsampled 500 cells per glial cell-type family in vertebrates and per glia cell type in amphioxus. We then generated a chordate glia tree using pvclust with the following parameters: nboot = 1000, method.hclust = “average”, method.dist = function (z){as.dist (1 – cor (z, use = “pa”, method = “spearman”))}.

RNA velocity and multipotency analyses in amphioxus

We performed RNA velocity based on velocyto.py (v.0.17)⁹⁴ and scVelo (v.0.3.3)⁹⁵. Each sample from N4 and T1 stages was preprocessed to generate a loom file with annotated spliced and unspliced reads, and loom files for the same developmental stage were merged and analysed together. Both spliced and unspliced raw counts (UMI) were first normalized using scvelo.pp.normalize_per_cell, and the top 3,000 genes with the highest variance were selected, followed by log-transformation via scvelo.pp.log1p. Dimensionality reduction and neighbourhood smoothing were performed using scvelo.pp.moments with parameters n_pcs=50, n_neighbors=30. For dynamical modelling of transcriptional kinetics, we applied scv.tl.recover_dynamics and subsequently executed scv.tl.velocity(mode = ‘dynamical’).

To assess cell-type multipotency, amphioxus gene IDs were mapped to one-to-one mouse orthologues based on reciprocal best hit, and CytoTRACE2 (ref. ⁹⁶) was applied with the parameters species = “mouse”, seed = 42.

Generation of amphioxus SoxE mutants

CRISPR–Cas9-mediated gene editing was used to generate SoxE mutants as previously described⁹⁷. A gRNA targeting the sequence (5′-GGCCCATGAACGCCTTCA-3′) at the beginning of HMG-encoding region was selected and synthesized. A PCR primer pair (forward: 5′-TGAGTTTAGCGGCGATCAGT-3′; reverse: 5′-TAGTTTCCCCAGCGTCTTGC-3′) spanning the target site was used to amplify the genomic region. The amplicon was digested with the restriction enzyme XmnI (5′-GAANNNNTTC-3′) to determine the gRNA efficacy and to identify the heterozygous and homozygous mutants. Heterozygotes carrying an 8 bp deletion in the target site were screened and used for the study. Homozygotes were acquired by crossing the heterozygotes.

In situ hybridization chain reaction

Expression patterns of SoxE, OligB, Eaat2 and Syn were detected by HCR (v.3) as previously described⁹⁸. The probe sequence information is provided in Supplementary Table 10. DAPI (Invitrogen, 1 mg ml^–1 in PBST) was used for nuclear staining. After staining, the samples were stored in antifluorescence quencher medium (S2100, Solarbio) and photographed under a confocal laser scanning microscope (LSM980 Airyscan2, Zeiss).

Single-embryo bulk RNA-seq and downstream analyses

Gonadally mature SoxE heterozygous F₁ female and male amphioxus were subjected to the thermo-based method (from 19 to 29 °C) to produce gametes. Fertilized eggs were incubated in an incubator maintained at 30 °C and 95% humidity, in which embryos developed to the N4 and T1 stages. At each stage, 15 embryos were randomly selected and each embryo was carefully placed into a PCR tube, with efforts to remove seawater while ensuring embryo survival. The PCR tube containing one embryo was snap-frozen in liquid nitrogen for 10 min, and the samples were subsequently stored at −80 °C. The samples were later sent to Tenk Genomics for Smart RNA extraction, Smart-seq2-based RNA reverse transcription, cDNA quality assessment, amplification, purification and quantification. The cDNA was returned for SoxE genotype identification. We designed another pair of PCR primers (forward: 5′-GAGCCCACCGAGCTCGA-3′; reverse: 5′-TAGTTTCCCCAGCGTCTTGC-3′) to amplify the SoxE gRNA2.5 target site and used the restriction enzyme XmnI for genotype analysis of each sample. Three wild-type and three mutant samples from the N4 and T1 stages were selected for sequencing library construction using Nextera technology by Tenk Genomics. Sequencing was performed on a BGI T7 platform with a PE150 mode, and the sequencing depth was 6 Gb. Sequencing statistics are provided in Supplementary Table 11 and Supplementary Fig. 22. Genotypes were further confirmed by read alignments (Supplementary Fig. 21). After the removal of low-quality reads and adapter sequences using fastp (v.1.0.1)⁹⁹, clean reads were aligned to the amphioxus genome using STAR (v.2.4.0g1)¹⁰⁰ with a maximum intron length of 10,000 bp (–alignIntronMax 10000), optimized for amphioxus. Gene raw counts of each sample were calculated using featureCounts (v.2.1.1)¹⁰¹ from BAM for DESeq2 (v.1.42.0)¹⁰² normalization and differential gene expression analysis.

Subfunctionalization and neofunctionalization

Gene relationships were assessed on the basis of the above-described OrthoFinder output. To avoid risks of skewing from high copy-number SSDs and uncertainty on orthology inference due to gene turnover, only orthogroups with fewer than or equal to five gene copies for each species were retained. To infer ancestral states without considering gene losses (loss of all copies), we retained orthogroups with at least one copy for each species. This enabled us to do cross-species comparisons directly at the orthogroup level as at least one gene for each species was present for each orthogroup. An orthogroup was classified as an ohnologue orthogroup if it contained one pair of ohnologues in at least three out of four vertebrates, which resulted in 1,872 ohnologue orthogroups. The same approach was applied to SSDs, and 1,050 SSD paralogue orthogroups were identified. Some (339) orthogroups were considered as both an ohnologue orthogroup and SSD paralogue orthogroup.

To predict ancestral states, we next binarized expression matrices using two separate approaches: based on whether genes were classified as markers, and based on gene expression or not determined by the Trinarization score. For the second approach, a gene was considered expressed if it was estimated to be present in at least 10% of the cells, with a posterior error probability of no more than 5%. Details of the Trinarization score have been previously described³. We inferred ancestral states for vertebrate and amniote lineages across homologous cell-type families. For vertebrate ancestral states, a gene family was considered expressed (state = 1) in a given cell-type family if at least three out of four species used one or more copies from that paralogue family in that cell-type family. The same criterion was applied for predicting amniote ancestral states, whereby expression (state = 1) was assigned if at least two out of the three species were being considered. The extent of subfunctionalization and neofunctionalization in gene families was quantified by comparing the binarized expression patterns of individual genes to the inferred ancestral states. Specifically, the difference between the binarized expression of a gene and orthogroup ancestral state was computed, in which a value of –1 indicated subfunctionalization (unless all copies in that species were –1, which indicated loss of function; Fig. 4a) and a value of +1 denoted neofunctionalization.

Expression divergence (dT) among paralogues

Gene relationships were based on the above-described OrthoFinder output. To avoid risks of skewing from high copy-number SSDs and uncertainty on orthology inference due to gene turnover, only orthogroups with fewer than or equal to five gene copies for each species were retained. To perform the pairwise comparison in shared orthogroups, orthogroups with at least one copy for any of the four species were further retained. Paralogue orthogroups were then defined as orthogroups that included one pair of paralogue genes in at least three out of four vertebrates. For a combination of paralogues in orthogroup, we calculated the expression divergence (dT) based on a simple formula¹⁰³ for each species separately. Specifically, dT was first calculated for each pair of paralogues by the fractional difference between the number of cell-type families expressing either paralogue (N_either) and the number of cell-type families expressing both paralogues (N_both) relative to N_either. dT was next averaged in a paralogue orthogroup (when there was more than one pair of paralogues) for each species.

Cell-type nonspecific dominant expression

To compare gene expression levels between paralogues for each species, we first calculated the average normalized expression levels for each gene using the Seurat::AverageExpression function⁸², and the proportion of cells expressing specific genes (pct. exp.) with an expression count greater than 0. These calculations were performed at both the cell-type family and cluster levels. Next, we use the igraph package¹⁰⁴ to construct ohnologue and SSD paralogue families based on previously identified ohnologue pairs and SSD paralogue pairs, respectively. We tested the expression levels and pct. exp. values using the friedman_test function from the rstatix package¹⁰⁵, as the data did not follow a normal distribution. For species pairwise comparisons in individual ohnologue and SSD paralogue families, we applied the rstatix::wilcox_test function with Bonferroni-adjusted P values to identify the highest-expressed (dominant) copy in each gene family and to search for whether their orthologues are the dominant copy in another species. One-to-one orthologue relationships underpinning this were derived from above-described OrthoFinder results.

Variance decomposition and the identification of genes highly contributing to cell-type and/or regional identity

To assess the contribution of a gene to cell-type identity and regional identity, we constructed a sum of UMI in expression matrices with three major cell-type families in the brain (excitatory neurons, inhibitory neurons and astrocytes) along with four brain divisions (telencephalon, diencephalon, mesencephalon and rhombencephalon). The pseudobulk expression was calculated by the sum of gene counts (UMI) for each gene in individual cell-type families. To balance cell number differences, 2,000 cells were randomly selected for each cell-type family in each species. We then used DESeq2 (ref. ¹⁰²) to normalize library sizes and performed LMM for each gene with the lme4 package¹⁰⁶. The restricted maximum likelihood estimators for the random effects of cell-type, regional and residual variance were normalized by their sum to give the variance components (Extended Data Fig. 9c). Genes that contributed >25% of the total variance to cell-type family or regional identity were classified as genes that highly contributed to cell-type signals and regional signals, respectively.

Analysis of the CN

We downloaded human, mouse and chicken CN datasets⁴⁶. These datasets were further filtered to retain only protein-coding genes and excitatory neurons, which show higher regional variants than inhibitory neurons in the CN. We detected DEGs as described above, using FindAllMarkers with parameters (wilcox, only.pos = TRUE), and only DEGs with log₂[fold change] > 0.58 and adjusted P < 0.01 were retained. Scaled average expression was calculated using Seurat AverageExpression⁸² and then normalized by dividing the expression of each gene by its mean among different cell types. The transcriptomic dendrogram was calculated on the basis of scaled average expression of DEGs using pvclust with the following parameters: Spearman’s correlation-based distance 1 – cor() and average linkage with 1,000 bootstrap replicates. Expression profiles were binarized using the Trinarization score, and a gene was considered expressed if it was estimated to be present in at least 20% of the cells, with a posterior error probability of no more than 5%. We used a 20% threshold here rather than the 10% applied in the previous analysis because several documented CN-related TFs showed substantial differential expression with more than 10% cells expressing the gene. For comparisons involving serially homologous structures, such as distinct CN types, a more permissive cutoff was appropriate to avoid excluding biologically meaningful signals. The binarized data were then used to infer ancestral states based on dendrograms using maximum parsimony. Specifically, we used the phangorn package¹⁰⁷, converted the binarized expression into phyDat format and applied the ancestral.pars function with the accelerated transform (ACCTRAN) approach to estimate ancestral states and return probability. Genes in each ancestral node were classified as expressed if the probability exceeded 0.5, and as not expressed otherwise. Finally, we identified gene expression gain and loss events along branching points in the tree to identify candidate genes that might be involved in the cell-type duplication and divergence.

Ethics approval

Work with lamprey embryos was approved by the University of Oxford, Department of Zoology Animal Welfare and Ethical Review Board. Ethical review was not required for work with amphioxus.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.