Mouse lemur cell atlas informs primate genes, physiology and disease

uTAR analysis to identify unannotated genes

To uncover uTARs, we used a previously published workflow¹¹ for scRNA-seq data that identifies TARs, genome regions with abundant transcript alignments. In brief, all mouse lemur 10x datasets were aligned to the genome assembly Mmur 3.0 using STAR with default parameters, without gene annotation indexing. Transcribed regions were predicted using groHMM⁵⁹. TARs within 500 bp of another were combined into a single TAR and kept if they were expressed in at least 2 cells of the 10x atlas dataset. The detected TARs were then separated into aTARs and uTARs on the basis of whether the region is currently annotated as a gene in the NCBI annotation release 101 of Mmur 3.0. This strategy identified that aTARs and uTARs cover 284 and 42 Mbp, respectively, of the mouse lemur genome (2,487 Mbp).

To filter out transcriptional and sequencing noise from biologically significant uTARs, we then examined whether a uTAR was differentially expressed across cell types using Wilcoxon rank-sum tests. This analysis was performed separately for each tissue and individual lemur. A DE uTAR was defined as a uTAR if it met the following criteria: had a significant Bonferroni-corrected P < 0.05 from the two-tailed Wilcoxon rank-sum test; expressed in ≥25% of cells of a cell type; and had a cell-type mean expression level of ≥1.65 (e^0.5) times the average of other cell types in the same tissue. Some of the uTARs passed the differential expression test in multiple cell types and/or tissues. Together, a total of 4,003 DE uTARs were identified. To infer their identity, we applied BLASTn on each of the DE uTARs against the nucleotide collection (nt) database (with a threshold of maximum e value of 0.01 and a minimum bit score of 50) using either the entire length of the uTAR or the peak coverage region (full width at half maximum region around the absolute peak in coverage after Gaussian smoothing in the uTAR location). Occasionally, multiple uTARs aligned to the same gene in another species. The genome location and inferred homology of all DE uTARs and their expression levels across the cells in the atlas (10x data) are provided in Supplementary Tables 1 and 2. There were 30 DE uTARs with a BLASTn result that corresponded to one of the 2,060 human genes without a mouse lemur orthologue, which are probably genes missed from annotations of Mmur 3.0 in the NCBI and Ensemble databases (for example, GSTA3 in tendon cells of the bone, TIGD1 in CD4⁺ T cells and SPRR2G in suprabasal epidermal cells; Supplementary Table 1). Note that 14 out of these 30 genes have no mouse orthologue, which suggests that these are PS genes.

DE uTARs were also classified as protein-coding or non-coding using the custom program Nf-core/predictorthologs (https://github.com/czbiohub-sf/nf-predictorthologs)⁶⁰, with sequences containing 95% or more k-mers (k = 9) matching a reference database of mammalian proteins from UniProt assigned as putatively protein-coding, and otherwise assigned as non-coding. Coding sequences were then annotated using DIAMOND blast⁶¹ and non-coding sequences were annotated using Infernal cmscan (https://www.ebi.ac.uk/Tools/rna/infernal_cmscan/), both algorithms that are built into Nf-core/predictorthologs. Some uTARs had both coding reads and non-coding reads, which potentially represent incompletely spliced transcripts or untranslated regions.

To detect the developmental trajectory of the sperm lineage cells in the testes 10x dataset using the uTAR expression data, we followed the same procedure as for detecting the trajectory through gene expression data, which is described in the accompanying paper¹. The analysis included a total of about 50,000 uTARs that have transcript reads in the testis dataset. Each uTAR was treated as a gene, and data were first normalized for scRNA-seq library size (to 10,000 total uTAR transcripts per cell) and natural log-transformed. The top highly variable uTARs (around 1,500) were then used for principal component analysis. The top 20 principal components that were not driven by extreme outlier data or immediate early genes were used to construct a two-dimensional (2D) UMAP using cell–cell Euclidean distances as input. The pseudotime developmental trajectory was then identified as the density ridge of the data in this 2D UMAP through automated image processing. Cells were assigned to the trajectory on the basis of the shortest connecting distance. The pseudotime trajectory coordinates of the cells were linearly normalized such that the trajectory started at 0 and ended at 1, and then were compared with the pseudotime coordinates derived from the gene-based trajectory using Pearson’s correlation. The uTAR expression data were similarly pre-processed (normalized, scaled and UMAP embedded) in other analyses, including when comparing the UMAP cell distribution patterns of the 10x colon dataset (Extended Data Fig. 1d) and in the silhouette coefficient analysis (Extended Data Fig. 1a).

To measure the consistency of cell distribution patterns for 10x datasets embedded in a UMAP using the gene, aTAR and uTAR expression spaces, silhouette coefficient values were calculated for each dataset (separated by tissue and individual lemur, and sequencing channel). Cells were grouped according to cell-type designation (free annotation). The silhouette coefficient value for each cell i in a dataset was calculated as s(i) = (b(i) – a(i))/max{b(i),a(i)}, where a(i) is the mean in-group distance (mean distance of cell i to the other cells in the same cell type) and b(i) is the minimal out-group distance (minimal distance of cell i to any cell in the tissue of a different cell type). The cell silhouette coefficient values were then averaged across each dataset to derive the dataset-averaged silhouette value, which is an overall score representing how well each cell type co-clusters and separates from other cell types in the UMAP embedded space, with higher positive values representing better separation. The silhouette coefficient values were calculated separately using the cell-to-cell distances in UMAPs based on the gene expression, aTAR expression and uTAR expression spaces, and then compared with a box plot (Extended Data Fig. 1a).

The lists of genes for each category shown in Extended Data Fig. 1f were obtained as follows. Lists of the top n variable NCBI-annotated genes were derived by applying variance-stabilizing transformation to the entire 10x dataset and selecting the genes with top n transformed variance. The list of the genes annotated by Ensembl only (not by NCBI) were detected by comparing the genomic positions of mouse lemur genes annotated in the two databases, searching for the Ensembl-annotated gene with no overlap in NCBI. The PS genes were derived as described below and listed in Supplementary Table 9.

To determine the amount of transcriptomic sequence information provided by the mouse lemur scRNA-seq atlas datasets, we estimated the total number of sequenced base pairs that mapped to the mouse lemur genome. For the 10x datasets, the total number of aligned paired-end reads (around 1.63 × 10¹⁰) was multiplied by the number of base pairs per read (90), which equated to 1.46 × 10¹²bp, although this number is inflated by PCR duplicates. However, summing the number of UMIs across all cells in the atlas (around 1.22 × 10⁹) and multiplying it by the number of base pairs per read (90) equated to 1.10 × 10¹¹bp, which is 10-fold less than the estimate with total reads. For the SS2 datasets, the total number of aligned paired-end reads (1.11 × 10¹⁰) was multiplied by the number of base pairs per read (100), which equated to 1.11 × 10¹²bp (unique reads are not possible to identify in SS2 datasets). By comparison, the total number of bulk RNA-seq base pairs used by the NCBI for gene prediction and annotation of Mmur 3.0 equates to about 3.4 × 10¹¹bp (across around 3.4 × 10⁹ total reads), which was calculated by summing the number of sequenced base pairs for each RNA-seq run (all tissues and generic samples included), information obtained from the Sequence Read Archive (SRA) (biosample identification numbers and corresponding SRA link available at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Microcebus_murinus/101/). However, this estimate is inflated because it includes base pairs from unaligned reads and does not correct for PCR duplicates.

SICILIAN splicing analysis

To identify splice junctions of mouse lemur transcripts, we used SICILIAN, a statistical method used for unbiased and annotation-free detection of splice junctions^14,62. Raw sequencing reads from the atlas (10x and SS2 datasets) were first aligned to the mouse lemur genome assembly Mmur 3.0 using the STAR algorithm with parameters chimSegmentMin = 12, chimJunctionOverhangMin = 10, chimOutType = “WithinBAM SoftClip Junctions”, and default values for the rest of the parameters. SICILIAN then extracts spliced reads that mapped to discontinuous regions of the genome. These reads could provide evidence for candidate splice junction sites but may also reflect sequencing noise or alignment error. For each of these potentially spliced scRNA-seq reads, SICILIAN estimates a read-level confidence score, which quantifies the probability that the alignment of the read is true based on features that influence sequence alignment (for example, sequence entropy of the read, sequence mismatches and number of mapping locations for the read in the genome). It then incorporates all the read-level scores for the reads aligned to each potential splice junction and computes a final confidence score (empirical P value) for the junction. The empirical P value was computed for each SS2 cell and 10x channel in the dataset, separated by lemur individual, then the median was calculated for these empirical P values across the dataset. Junctions with median empirical P < 0.1 were selected for follow-up analyses. The threshold 0.1 was determined as the optimal point in the receiver operating characteristic (ROC) curve based on simulated data in the article describing the SICILIAN method¹⁴ (see ROC in Fig. 1e), which identified that a threshold of 0.15 maximizes discovery sensitivity and specificity (Youden’s index) using bulk simulated RNA-seq data. Here we prioritized specificity to identify high-confidence novel junctions and therefore used a more stringent threshold (0.1).

The detected splice junctions were compared with the junctions in transcripts annotated in the NCBI annotation release 101 of the mouse lemur genome assembly Mmur 3.0. The detected splice junctions were categorized into five types (Fig. 1g). Type A refers to junctions that matched an annotated splicing pattern. Types B–D refer to junctions that aligned to an annotated gene but the specific splicing pattern is unannotated. Specifically, type B contains junctions in which both the donor (5′) and acceptor (3′) splice sites are annotated but not previously paired (for example, unannotated exon skipping), type C contains junctions in which one site is annotated and the other is not (for example, annotated donor site but unannotated acceptor site) and type D contains junctions in which both splices sites are unannotated. Type E refers to detected junctions that do not align to any annotated gene.

To examine whether the detected splice junctions are conserved in human or mouse genomes, we used the UCSC LiftOver tool from the UCSC genome browser⁶³ and computed the fraction of junctions annotated in the mouse and/or human genome by considering a junction as conserved only if it had successful LiftOver conversion to the other genome (that is, both 5′ and 3′ splice sites in the lemur genome mapped successfully to unique coordinates in the other genome).

To identify cell-type-specific splicing events, we performed two-tailed MANOVA separately on the 10x data from each tissue. Cell types with fewer than ten cells or junctions present in fewer than two cells were removed from the analysis. To highlight splicing events with global effects, we analysed the genes with at least two spliced reads mapping to the gene in each cell type in the tissue. Note, however, our approach can easily be extended to include more genes expressed in only a subset of cell types. Let C be the set of cells, and J be the set of junctions for this gene. Consider cell \(m\in {C}\) and junction \(i\in J\) for a particular gene. Let \({n}_{m}^{(i)}\) be the number of reads mapping to junction \(i\) in cell \(m\). The fraction of junctional reads mapping to junction i in cell m is therefore defined as follows: \({f}_{m}^{(i)}={n}_{m}^{(i)}/\sum _{j{\epsilon }J}{n}_{m}^{(j)}\). The dataset average fraction of junction i was then calculated as follows: \({f}^{(i)}=\sum _{c\in C}{n}_{c}^{(i)}/\sum _{c\in C}\sum _{j\in J}{n}_{c}^{(j)}\). The scaled z score for junction i in cell m was defined as follows: \({z}_{m}^{(i)}=\left(\sqrt{\sum _{j\in J}{n}_{m}^{(j)}}\right)({f}_{m}^{(i)}-{f}^{(i)}\,)/\sqrt{{f}^{(i)}(1-{f}^{(i)}\,)}\). MANOVA was then performed using the cellular z scores of each junction for the gene as input and the cell type (free annotation) as output. This analysis generated, for each gene, a P value of all cell types in the tissue having the same multivariate mean junctional expression. Benjamini–Hochberg correction was then applied to all P values. To identify a list of candidate junctions with the most significant cell-type differential splicing, a stringent threshold (corrected P < 10^–16) was used, which resulted in 545 junctions.

SAMap analysis to study the conservation of gene expression patterns across species

To compare expression patterns of homologous genes across the human, lemur and mouse genomes, we used the SAMap method^17,64. In addition to the lemur 10x cell atlas datasets, mouse and human 10x scRNA-seq datasets were retrieved from Tabula Muris Senis⁶⁵ and Tabula Sapiens⁶⁶, respectively. We applied SAMap to these datasets to compare the cell-type expression patterns of homologous genes across the three species. Although the SAMap algorithm does not require cell-type labels, having comparable annotations simplifies the interpretation of the mapping results. Therefore, we limited the analysis to the tissues (lung and skeletal muscle) that were re-annotated using the same standards as described in the accompanying paper¹. This cross-species data with new unified cell-type designation can be found on Globus (see Data availability).

SAMap was used to simultaneously map genes and cells across the three species (human, lemur and mouse). For each pairwise combination of species, SAMap first detects homologous genes (sequence homologues) through bidirectional BLAST analysis of the transcriptomes of the two species, as annotated by Ensembl and NCBI. A cross-species gene-to-gene graph is then generated, with edges connecting a gene in one species and a homologous gene in the other species and edge weights assigned as sequence similarity of the gene pairs. The homology graphs from all pairwise comparisons of species were combined into one, tripartite adjacency matrix. Using this initial gene graph, SAMap projects the three scRNA-seq datasets into a joint, lower-dimensional manifold representation. This joint manifold enables estimation of similarity between cells and genes across species. Note that the SAMap method considers not only one-to-one orthologues but also integrates many-to-one, one-to-many and many-to-many orthologous genes as well as the non-orthogonal relationship between non-orthologous genes, which are commonly ignored in cross-species comparisons given their complexity. Next, the expression correlations between homologous genes were calculated in the initial joint manifold to re-weight the edges of the gene–gene homology graph. Using the re-weighted homology graph as the new input, SAMap then iterates until convergence to generate a final joint manifold. The expression correlation between homologous genes of the two species calculated across the joint manifold quantifies the similarity of the expression patterns of two genes. Homologous gene pairs with an expression correlation higher than 0.3 were deemed expression homologues; that is, homologues that share similar expression patterns across mapped cells. Triads of mapped expression homologues from human, lemur and mouse datasets were identified.

We then examined for each expression homologue triad whether the three gene pairs were assigned as orthologous genes in NCBI and/or Ensembl (Supplementary Table 8). We further examined for each expression homologue triad whether the lemur gene is named or unnamed in NCBI annotation release 101 of the mouse lemur genome assembly Mmur 3.0 (with only a locus identifier, for example, ‘Loc__’ or ‘orf’). Expression homologue triads were then categorized into three types (Fig. 1l). Type ‘named orthologue’ refers to triads that consist of three orthologous gene pairs, and the lemur gene is named accordingly. Type ‘unnamed orthologue’ refers to triads of three orthologous gene pairs but the lemur gene is unnamed. Type ‘non-orthologue’ refers to triads that contain at least one non-orthologous gene pair, regardless of the naming status of the lemur gene. Supplementary Table 5 lists the expression homologues detected in this study.

Quantification of genes that are named (with a gene symbol), unnamed (only a locus identifier, for example, ‘Loc__’ or ‘orf’, and a suggested gene description) or uncharacterized (unnamed and with no gene description) for Fig. 1k was obtained from the following databases: genome assembly Mmur 3.0 and NCBI annotation release 101 for mouse lemurs; assembly GRCh38.p13 and NCBI annotation release 109 for humans; and assembly GRCm38.p6 and NCBI annotation release 108 for mice.

BCR analysis

To improve the annotation level of mouse lemur BCR loci (which contain immunoglobin genes), we first used BLAST⁶⁷ with human BCR genes (retrieved from ImMunoGeneTics (IMGT)⁶⁸) to search for unannotated variable and constant region genes (for example, IGG) in the mouse lemur genome (Mmur 3.0, NCBI annotation release 101). We then built a custom reference database from these retrieved mouse lemur immunoglobulin genes and the human IMGT sequences to extract transcripts and to assemble the immunoglobulin sequence for each of the 829 B cell and plasma cells from the SS2 data analysed using BASIC⁶⁹. Immunoglobulin sequences obtained through BLAST searches of the transcriptomes from a subset of mouse lemur atlas B cells were added to the custom reference database to further improve alignment. Constant regions from both the heavy and light chain contigs assembled using BASIC were aligned to the reference database using BLAST, and the best hit (with at least 80 nucleotides of overlap) was used to assign the isotype (that is, IGA, IGG, IGM or IGE for the heavy chain, and IGK or IGL for the light chain) for each cell. Putative V and J gene families and the CDR3 sequences from both the heavy and light chain contigs were identified using IgBlast⁷⁰. In some cases, BASIC was not able to generate a contig for the heavy and/or light chain; therefore, the isotype was not assigned for these cells (52 and 10 cells for the heavy and light chains, respectively). In other cases, BASIC was unable to assemble a single continuous contig from both constant and variable region ends of either the heavy and/or light chain, and therefore, we submitted to BLAST and IgBLAST the two contigs constructed from each end (94 and 100 cells for heavy and light chains, respectively) or the only constructed contig from one end (147, 1, 28 and 2 cells with only heavy chain constant, heavy chain variable, light chain constant and light chain variable contigs, respectively). In rare cases, constant-region isotypes were not assigned for cells for which BLAST returned different hits from the variable and constant region ends (16 and 13 cells for heavy and light chains, respectively). Similarly, V gene families were not assigned for cells for which IgBLAST returned different hits (alignment quality V score > 100) from the constant and variable contigs (3 and 59 cells for heavy and light chains, respectively). These discrepancies may be caused by doublets of B cells and plasma cells (although applying the program Scrublet⁷¹ with default parameters identified only 3 out of the 80 cells with 2 different BLAST or IgBLAST hits as possible doublets) or more probably, reflect dual expression as more recently appreciated⁷². CDRH3 lengths were calculated as the number of amino acids between the canonical C at the 5′ end of the sequence and the 3′ sequence WGXG, where X is any amino acid. CDRL3 (including λ and κ chains) lengths were calculated as the number of amino acids between the canonical C at the 5′ end of the sequence and the 3′ sequence FGXG or WGXG, where X is any amino acid.

All immunoglobulin sequences assembled through BASIC were then used to determine the minimum number of constant and variable region alleles in the mouse lemur genome. These sequences were aligned using MAFFT⁷³ and then manually corrected using Geneious Prime (v.2021.1.1; https://www.geneious.com). Because somatic mutation patterns in IGV genes can render a single V gene indistinguishable from separate but closely related alleles, we estimated a minimum number of V genes based on the number of V loci that occur in the current assembly of the genome. Long-read sequences covering this region would help determine the true number of V gene loci.

Clonal lineages were identified as groups (n ≥ 2) of cells in a single individual lemur with the same light chain isotype and identical CDRH3 and CDRL3 lengths, with both having at least 80% identity across the cells.

MHC gene expression analysis

The methods used to examine mouse lemur MHC gene structure, to extract MHC gene expression from the atlas and to re-annotate MHC genes are detailed in a previous study²¹. In brief, raw fastq files from 10x scRNA-seq data from each organ for all individual lemurs were mapped against a MHC reference sequence extracted from the Mmur 3.0 genome assembly according to the bacterial artificial chromosome (BAC) sequences^74,75, as well as the known expressed mouse lemur class I Mimu-W01-04 (ref. ⁷⁶) (GenBank accession numbers are provided in Supplementary Note 3) using bowtie2 (ref. ⁷⁷). The mapping results were assessed for mismapping of reads, allelic variation and the possible presence of additional genes through manual inspection using the Integrative Genomics Viewer (IGV)⁷⁸ and Geneious Prime (v.2021.2.2). The fastq files were also ‘probed in silico’ by searching for reads that contained sequences specific to the known genes. This approach confirmed the absence of expression of particular MHC genes. For the analysis of expression levels, a reference specific to each individual was created and used with bowtie2 to map the reads extracted from the raw fastq files for each tissue. The sequences used were restricted to the final 600 bp (comprising exon 5 through the 3′ UTR) to avoid complications from potential recombinant sequences. Manual inspection of the results from the blood 10x scRNA-seq files was used to determine a mapping quality (MAPQ) threshold for each gene. The sequence alignment map (SAM) file from the mapping was converted to a BAM file and then divided into individual BAM files for each gene. These individual files were then filtered to remove reads below the MAPQ threshold. For the remaining reads, the cell barcode and UMI were counted. The expression level was normalized as read counts per 10,000 UMIs and then natural log transformed. Counts for each MHC gene (raw and normalized) are available in the metadata for every h5ad file in Figshare (https://figshare.com/projects/Tabula_Microcebus/112227). SS2 data were not used owing to limited data available regarding allelic variation and recombination between alleles and/or genes that is prevalent in the MHC. The lack of phase information for the SS2 data made it impossible to accurately assign all sequences to specific genes (only the terminal 3′ 600–650 bp could be assigned with confidence to a particular class I gene). Discarding the upstream information would have biased the expression-level results. Thus, we chose to focus on the 10x dataset, for which the majority of the sequences were obtained from a single region that fell within 600–650 bp of the 3′ end and therefore could be unambiguously assigned to a specific gene.

Chemokine ligand and receptor expression analysis

A list of human chemokine receptors was compiled from the literature^79,80 and their cognate ligands were obtained from CellPhoneDB⁸¹ (Supplementary Table 7). We included the four atypical chemokine receptors, which induce G-protein-independent downstream signalling⁸², as well as the chemerin (encoded by RARRES2) receptors (CMKLR1, GPR1 and CCRL2) given their established dual role in immune and adipokine chemoattraction⁸³. The chemokine CXCL17, without a known receptor, was also included. Of the 25 identified receptors, a corresponding lemur orthologue annotated in NCBI was found for all except CCR2. Of the 45 cognate ligands, a corresponding lemur orthologue was identified for 32. The expression level of each of the lemur orthologues across all cell types in the 10x dataset is summarized in Supplementary Fig. 3c. Cell-type expression levels for each gene were then binarized (that is, expressed or not expressed) based on absolute and relative thresholds for the purpose of building an interaction network, per below. First, an absolute threshold was applied, which required that a gene is expressed at non-zero levels in at least 5% of cells of a cell type and with a mean expression level of at least 0.5 across all cells from that cell type. Second, a relative threshold was applied, whereby for each gene, a ceiling expression level was defined as the expression level of the 99th percentile of all cell types that passed the first threshold (to remove outliers with abnormally high expression levels). Cell types with a receptor gene mean expression level above 5% of the ceiling were deemed to be expressing the receptor. A higher threshold (20%) was applied for ligand genes given that ligands are diffusible and therefore require high levels to be functional.

To build a chemokine interaction network across all cell types in the atlas, connections (edges) were drawn between cell types (nodes) expressing a ligand and cell types expressing the cognate receptor. Self-loops were allowed, wherein a cell type expressed both the ligand and the corresponding receptor. Connections between cell types from different organs (other than blood) were excluded given the short effective intercellular communication distances of chemokine signals. Note that edges are directed such that cell type A expressing a ligand and cell type B expressing the cognate receptor formed a separate edge from cell type B expressing the same ligand and cell type A expressing the cognate receptor. Multiple connections in the same direction between two nodes (that is, two cell types with more than one receptor–ligand interaction) were counted as a single edge. The network density was calculated as the number of edges identified divided by the total number of possible edges in the network: \(\frac{{N}_{{edges}}}{{N}_{{nodes}}^{2}}\). The density was calculated separately for the following networks: interactions across all cell types in the atlas; only immune cell types; only non-immune cell types; and between immune and non-immune cell types.

Cross-organ immune cell analysis

Immune subpopulations were identified and annotated through the systematic subclustering of the lymphoid and myeloid compartment in each tissue for every individual lemur, then adjusted through inspection using cellxgene after integration of tissues across all individuals to ensure consistency of cell-type labelling, as described in the accompanying paper¹. Clusters branching off the main group were labelled with a differentially expressed gene (DEG) (for example, neutrophil (CCL13⁺), neutrophil (IL18BP⁺) and B cell (SOX5⁺)), and cells expressing proliferative markers (MKI67 and TOP2A) were appended with ‘PF’ (for example, B cell (PF)). For macrophages, their identities as tissue-resident macrophages based on published marker genes (see supplementary table 1 in the accompanying paper¹) was indicated by appending the corresponding name (for example, macrophage (Kupffer cell), macrophage (microglial cell)), given that a clear distinction from monocyte-derived macrophages was challenging (with the exception of lung tissue-resident alveolar and monocyte-derived interstitial macrophages, which were confidently distinguished on the basis of canonical markers and labelled as such). Identification of DEGs for each subpopulation was performed using two-tailed Wilcoxon rank-sum tests, selecting genes with log fold change ≥ 1 and P < 0.05 after adjustment by using Benjamini–Hochberg correction.

For the cross-organ monocyte–macrophage analysis, all granulocyte–monocyte precursors, monocytes and macrophages from the atlas were extracted for further analysis. Data were integrated across the four lemur individuals and then across the scRNA-seq methods (10x and SS2 datasets) using the FIRM algorithm⁸⁴ to correct for batch effects. In this integrated UMAP, monocyte populations co-clustered across tissues, whereas macrophage populations were generally separated by tissue. We therefore tried additional FIRM integration across tissues; however, tissue-specific separation of macrophage types and bladder monocytes from lemur L2 remained. Therefore, we did not perform tissue-level FIRM integration for the final UMAP to avoid potential computational bias from overcorrection. We then examined the expression levels of known monocyte and macrophage markers reported in the literature as well as the distribution of monocytes and macrophages from each tissue in the FIRM-integrated UMAP (Extended Data Fig. 8 and Supplementary Fig. 4). In addition to the tissue-specific and tissue-resident populations highlighted in Supplementary Fig. 4, we found that pancreatic and heart macrophages formed separate populations. However, these results were excluded from further analysis because they probably resulted from technical issues. That is, the DEGs for pancreatic macrophages were broadly expressed in other cell types of the same tissue (signal spreading), and the heart sample had overall fewer transcripts per cell (lower quality).

The neutrophil developmental trajectory was based on embedding of neutrophils in the FIRM-integrated UMAP of the entire atlas (as described in the accompanying paper¹). This resulted in co-clustering of neutrophils by the individual and tissue, which enabled recapitulation of the developmental trajectory. We also tried FIRM integration of neutrophils alone (by individual and scRNA-seq methods). However, this resulted in separation of neutrophils by tissue and individual, which was largely driven by batch effects (no biologically meaningful DEGs were identified across most clusters). This result suggests that neutrophils are more molecularly homogeneous across tissues compared with other cell types such as macrophages. The trajectory was obtained using an in-house algorithm that detects the density ridge of the cell distribution on the FIRM-integrated UMAP embedding, as described in the accompanying paper¹, with the direction of the trajectory manually assigned on the basis of the expression of neutrophil maturation markers. Similar to the neutrophils, the FIRM-integrated UMAP of B cells and plasma cells showed global separation of plasma cells and B cells. However, further cell separation was driven by batch effects. In the atlas-wide UMAP, the clear separation between B cells and plasma cells precluded further trajectory analysis.

Endometrial cancer analysis

Uterine cancer was identified by scRNA-seq and later confirmed by histopathology in both of the female lemurs (L2 and L3). Both had metastases, with L2 showing spread to the lung and L3 to an intra-abdominal lymph node. We analysed lung metastasis in lemur L2 and the primary tumour in lemur L3. The uterus of L2 was not analysed by scRNA-seq because we were unaware of the tumour at the time of tissue collection. For lemur L3, we were unable to sequence the metastasis given the liquefactive necrotic nature of the tissue.

To compare the novel lung epithelial cell cluster in L2 (retrospectively identified as endometrial tumour cells metastasized from the uterus) with all other cell types of the atlas, we examined the correlation scores (Fig. 3d) and UMAP embedding (Extended Data Fig. 9e) of their gene expression profiles using the methods described in the accompanying paper¹. Here we extracted results of the metastatic tumour cell type. In brief, to calculate the cell-type pairwise correlation scores with the lung metastatic tumour cell type in lemur L2, atlas data were first integrated across individuals, tissues and scRNA-seq methods (10x and SS2) using FIRM⁸⁴, and FIRM-generated principal component coefficients were calculated for each cell. The coefficients were then averaged across all cells of a cell type and used to calculate the Pearson’s correlation scores between every atlas cell type and the metastatic cells. The lung cell type in L2 that is a hybrid of metastatic and AT2 cells, which could be doublets of the two cell types (although Scrublet⁷¹ only identified one of these five cells as a possible doublet) was excluded from the analysis. The lung metastatic tumour cells in L2 had high correlation (0.94) with the uterine non-ciliated epithelial cells (FXYD4⁺MUC16⁺) in L3, the presumptive primary tumour. Other cell types with high correlation scores to the metastatic cells included kidney ductal and secretory cells (0.80–0.95), pancreatic ductal cells (0.85–0.88), other uterine epithelial cells (0.70–0.89), fat urothelial cells (0.87), liver cholangiocytes (0.86) and brain ependymal (0.65).

To generate the cell-type UMAP (Extended Data Fig. 9e), gene expression levels were averaged across cells for each cell type (10x dataset, excluding low-quality cell types and ones represented by <4 individual cells). Expression levels were normalized (0 to 1 scale) to the maximal value of each gene across all cell types, and the normalized cell-type gene expression matrix was projected onto a 2D space with cosine distances between pairs of cell types as input.

Differential gene expression analysis was performed on lung metastatic cells versus all other lung epithelial cells and on uterine FXYD4⁺MUC16⁺ epithelial cells versus all other uterine epithelial cells (10x datasets) using two-tailed Wilcoxon rank-sum tests (P < 0.05, after adjustment using the Benjamini–Hochberg method), and selected examples are presented in Fig. 3e.

Adipocyte analysis

Adipocytes and adipo-CAR cells were extracted from the FIRM-scaled and integrated data of the entire atlas (1,231 cells, 10x and SS2 datasets, see accompanying paper¹). The top 3,000 highly variable genes in the FIRM-transformed gene count table of adipocytes and adipo-CAR cells were selected, and dimensionality reduction by principal component analysis was performed (top 13 principal components) to generate a 2D UMAP of adipocytes and adipo-CAR cells. Differential gene expression analysis on the UCP1^high and UCP1^low adipocyte populations (L2 and L4, 10x data) was performed using two-tailed Wilcoxon rank-sum tests (P < 0.05, after adjustment using the Benjamini–Hochberg method), and example genes were selected for presentation in Fig. 3i. Similarly, differential gene expression analysis was performed between the adipocytes of each fat depot of L2 (BAT, GAT, MAT and SCAT), and the top ten genes enriched in each depot were selected for presentation in Extended Data Fig. 10e.

Most of the adipocytes in the atlas were isolated from fat depots, for which the tissue-dissociation protocol was designed to enrich for the stromal vascular fraction and exclude adipocytes (see the supplementary methods in the accompanying paper¹). Most were from L2 (Extended Data Fig. 10a), whose adipocytes in fat depots surrounding several tissues (for example, kidney, spine and uterus) were generally small, predominantly multilocular, densely stained and mitochondrial rich (Extended Data Fig. 10d). These are also features of brown or beige adipocytes in humans and mice. They intermingled with small, unilocular adipocytes with a single lipid droplet, which resemble white adipocytes. By contrast, adipocytes from L3 and L4 were generally larger and most were unilocular (Extended Data Fig. 10d). These may be harder to capture using current scRNA-seq protocols, so may have contributed to the lower yield of adipocytes for L3 and L4.

Identification of PS genes and analysis of their expression patterns in lemur and human genomes

A list of human and lemur orthologous genes with no corresponding mouse orthologue was compiled by merging human, mouse lemur and mouse homology assignments from NCBI, Ensembl and MGI databases using a similar method used to compile the list of one-to-one-to-one gene orthologues for the comparison of cell types across the three species in the accompanying paper¹. We began by compiling all human protein-coding genes annotated in NCBI (taxonomy identifier (ID): 9606), then merged the corresponding mouse lemur and mouse orthologues from NCBI (gene_info.gz and gene_orthologs.gz from https://ftp.ncbi.nlm.nih.gov/gene/DATA/, accessed February 2020 and August 2023). We next added Ensembl gene ID numbers, gene names and lemur or mouse orthologue assignments from Ensembl Biomart (Ensembl Genes v.99, February 2020), using the Ensembl gene ID (variable ‘Gene_stable_ID’) for each NCBI gene ID (variable ‘NCBI_gene_ID’) in Ensembl Biomart. MGI mouse gene ID numbers, gene names and orthologue assignments (none provided for lemur) from the Jackson Laboratory (HOM_MouseHumanSequence.rpt from http://www.informatics.jax.org/downloads/reports/, Feb 2020) were added using the MGI homology ID (variable ‘HomoloGene_ID’) attributed to each NCBI gene ID (variable ‘EntrezGene_ID’) in the MGI database. The Online Mendelian Inheritance of Man (OMIM)⁵⁵ genetic disorder phenotype associated with each human gene (genemap2.txt from https://www.omim.org/downloads January 2022, variable ‘Phenotypes’) was added using the gene name (variable ‘Approved_Gene_Symbol’) in the OMIM database.

A human gene was identified as sharing an orthologue with lemurs if at least one such assignment was made by either Ensembl or NCBI, and/or as sharing an orthologue with mouse if at least one such assignment was made by NCBI, Ensembl or MGI (Supplementary Table 8). This approach resulted in 539 human genes with an assigned lemur, but no mouse, orthologue (that is, PS genes), which corresponded to 388 unique lemur Ensembl gene IDs and to 425 unique lemur NCBI genes IDs (not all orthologues are annotated in both NCBI and Ensembl). Note that gene orthology assignments from NCBI, Ensembl and MGI are periodically updated; thus, these numbers may change in the future. Transcripts were detected for 401 out of the 425 PS NCBI-annotated genes, and their expression patterns across all lemur atlas cell types (10x dataset) were visualized in heatmaps and dot plots and qualitatively categorized by whether their expression was enriched (higher or restricted expression) or depleted in one or more tissues or organs or compartments (Supplementary Table 9 and Supplementary Fig. 5). Expressed genes that did not show any of these expression patterns were categorized as ‘not enriched in any category’.

Gene set enrichment analysis of the 539 PS genes was performed using gprofiler2 in R⁸⁵, searching for overrepresented gene sets (relative to all human-annotated genes) in gene ontology terms, biological pathways, regulatory DNA elements, human disease gene annotations and protein–protein interaction networks, using default parameters (for example, user_threshold = 0.05, correction_method = “g_SCS” for Fisher’s one-tailed test with multiple testing correction).

We further analysed evolutionary conservation in the expression patterns of the PS genes that have one-to-one orthology mapping between humans and lemurs (Supplementary Table 9). The analysis followed a similar pipeline as described in the accompanying paper¹, in which we compared across human, lemur, mouse and macaque using one-to-one orthologues (that is, not including any PS genes). We analysed cells from the lung, skeletal muscle, liver (epithelial cell only), testis (germ cell only), as well as bone marrow and spleen (immune cells only). To unify cell-type annotation, data of different species were integrated using Portal with around 15,000 one-to-one orthologues, and cells were re-annotated for consistent designation across all species. Here we applied the same cross-species cell type annotation and compared between human and lemur only, with lemur data from the Tabula Microcebus atlas¹ and human data from the Human Lung Cell Atlas⁷⁹ (lung), ref. ⁸⁶ (testis) and Tabula Sapiens⁶⁶ (rest of the tissues).

With manual curations, we identified a total of 398 PS genes with one-to-one orthology mapping between humans and lemurs. Note that NCBI and Ensembl occasionally have inconsistent orthology assignments. For example, one database may assign a one-to-one mapping, whereas the other database may assign a one-to-many mapping. In such cases, we prioritized NCBI mapping but also maximized coverage by retaining the orthologues with identical gene symbols or description in both species. Next, we analysed 346 of the PS genes that were reported in all scRNA-seq datasets described above. Because the number of annotated genes were different between humans and lemurs, we normalized the transcript counts of the PS genes against the background of all one-to-one orthologues and then log transformed the expression levels (that is, ln(UP10K + 1)). For each PS gene, mean expression (E_max) in the maximally expressed cell type in each species was quantified. Next, we filtered for PS genes with notable expression across the analysed cell types, requiring E_max > 0.5 in each species, or E_max > 0.1 in each species and E_max > 1.5 in at least one species. This resulted in a total of 93 PS genes for which we quantified their expression pattern similarity between humans and lemurs. Mean cell type expression levels across the orthologous cell types were normalized by E_max of the same species, and Pearson’s correlation coefficients between humans and lemurs were calculated and reported in Supplementary Table 9. Most (55%, 51 out of 93) of the analysed PS genes had a correlation coefficient above 0.5, which indicated conservation in their expression patterns between lemurs and humans.

Analysis of natural mutations

We performed whole-genome sequencing for each of the four lemurs (L1–L4) used to create the atlas, along with 31 additional lemurs originating from the same laboratory colony (median 54× coverage). Methods for the whole-genome sequencing and analysis pipeline are detailed in a recent study⁵⁸ and in the accompanying paper¹. In brief, genomic DNA libraries were generated through Tn5-based tagmentation, and indexed and PCR-amplified for 150 bp paired-end short-read Illumina sequencing. Sequencing reads were aligned to the current Mmur 3.0 genome assembly (NCBI RefSeq assembly accession GCF_000165445.2) and germline variants were identified using the Sentieon DNAseq workflow. Variants were annotated and filtered, and functional impact predictions were made using the SnpEff & SnpSift toolbox. This resulted in around 45 million total variants across all 35 lemurs. To identify functionally significant rare nonsense variants relevant for this study, we first filtered for three criteria: (1) allele frequency < 0.5; (2) base call quality > 99.9%; and (3) homozygous or heterozygous variants present in at least one of the lemurs (L1–L4) used for the atlas. This resulted in 6,905 variants. Next, we refined this list by filtering for variants for which their respective genotypes were identified in all sequenced animals, focusing on nonsense variants by looking at those predicted to cause frameshift mutations, alterations in the stop codon and variants computationally predicted to cause NMD. This narrowed our final list to 713 unique variants found in 713 genes (1 per gene).

To analyse the transcriptional impact of these nonsense variants, we compared cell-type-specific gene expression of the affected gene in the four lemurs used to create the atlas. We prioritized genes that were abundantly expressed and potentially functionally important (for example, absent in mice). Cell-type specific scRNA-seq reads were identified by their 10x barcodes, parsed from the original post-alignment BAM files for each lemur and counted using Samtools (v.1.16.1) across the respective gene. This enabled discernment of the number of reads with the reference allele versus those with the alternative allele at the variant locus, along with the total number of reads mapping to the gene. Quantifying and analysing the differing allelic expression patterns of these genes, in the presence and absence of the variant, enabled us to verify nonsense variants linked to significant reductions in gene expression. To compare gene expression in WT and mutant individuals, cells were grouped by their cell-type designation (without distinguishing their tissue of origin). Cell-type average expression was then calculated for each individual separately (excluding cell types with <35 profiled cells (10x)). Cell types with no expression or low expression of the gene (ln(UP10K + 1) < 0.3) in control animals were not plotted in Fig. 5e,i,m and Extended Data Fig. 11c.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Mouse lemur cell atlas informs primate genes, physiology and disease

uTAR analysis to identify unannotated genes

SICILIAN splicing analysis

SAMap analysis to study the conservation of gene expression patterns across species

BCR analysis

MHC gene expression analysis

Chemokine ligand and receptor expression analysis

Cross-organ immune cell analysis

Endometrial cancer analysis

Adipocyte analysis

Identification of PS genes and analysis of their expression patterns in lemur and human genomes

Analysis of natural mutations

Reporting summary

Stem-cell treatment strengthens people with age-related frailty

White House stalls release of approved US science budgets

I will continue the fight for environmental justice in Black communities

Most Popular

Kimora Lee Simmons Wants To Drop Russell’s Last Name (HOLD)

Aerin Launches Wedding Registry

Frontier AI labs’ policies around military use of their AI tools are incoherent, vague, and often prone to change, allowing leadership to preserve “optionality”...

Baby Gear Guide: Strollers, Car Seats

Recent Comments

ABOUT US

POPULAR POSTS

Kimora Lee Simmons Wants To Drop Russell’s Last Name (HOLD)

Aerin Launches Wedding Registry

Frontier AI labs’ policies around military use of their AI tools are incoherent, vague, and often prone to change, allowing leadership to preserve “optionality”...

POPULAR CATEGORY