Methods summary
An overview of all experiments performed for this study is included in Extended Data Fig. 1a and Supplementary Table 1. For the mouse experiments with an available ground truth from the LARRY lentiviral barcoding system (experiments M.1, M.2, M.4, M.5), LARRY barcoding vectors were constructed and lentiviruses were produced (see the section ‘Lentiviral barcoding using the LARRY system’). Stem cells were then collected from mice, transduced with the LARRY lentiviruses and transplanted, and different cellular compartments were collected 5–10 months later for profiling by scTAM-seq (see the section ‘Experimental procedures (mouse study)’). Additional experiments were performed on biological material from non-treated mice of different ages (experiments M.3 and M.6–M.8; see the section ‘Experimental procedures (mouse study)’). For the human study, primary bone marrow samples were analysed (see the section ‘Experimental procedures (human study)’).
All biological material was analysed by scTAM-seq5 (see the section ‘Single-cell DNA methylation profiling with scTAM-seq’). scTAM-seq is a targeted method for DNA methylation analysis based on the Mission Bio Tapestri platform. Specifically, up to 1,000 amplicons 200–400 base pairs in length are amplified from the genomes of single cells. Before this amplification step, scTAM-seq includes a digestion step with a methylation-sensitive restriction enzyme, HhaI. Therefore, CpG dinucleotides in HhaI sites are only effectively amplified if methylated. The selection of the target amplicons comprising individual CpGs is a crucial step in this protocol (see the sections ‘Mouse panel design for scTAM-seq’ and ‘Human panel design for scTAM-seq’). Relevant genetic information (LARRY barcodes or CH mutations) can be read out from gDNA by scTAM-seq in the same cells, specifically by covering these regions with amplicons not containing HhaI cut sites. Surface-antigen expression was read out through the inclusion of oligonucleotide-barcoded antibodies in the protocol. We also included dedicated experiments demonstrating the combination of scTAM-seq with RNA-seq from the same single cell (experiment X.1, see the section ‘Combined profiling of DNA methylation and RNA in the same cell’) or mitochondrial genome sequencing (experiment X.2, see the section ‘Combined profiling of DNA methylation and mitochondrial variants’).
Key steps in the data analyses (see the section ‘Bioinformatic analysis (mouse)’) were to define cell states through data integration and subsequently to identify clones using the EPI-Clone algorithm. This algorithm first identifies CpGs with no surface-antigen association as potentially clone-associated or ‘static’ CpGs, and subsequently performs clustering and dimensionality reduction exclusively on these CpGs (see the section ‘EPI-Clone’). For the analyses of the human data, the same overall strategy was used. Additional steps and adjustments included mutation calling and definition of a consensus set of static CpGs across donors (see the section ‘Bioinformatic analysis (human)’).
A detailed protocol for performing scTAM-seq for clonal tracing with EPI-Clone is available from protocols.io51.
Lentiviral barcoding using the LARRY system
Construction of lentiviral pLARRY vectors
Barcode libraries were constructed according to a previously established protocol (https://www.protocols.io/view/barcode-plasmid-library-cloning-4hggt3w). First, the T-Sapphire or eGFP coding sequences and the EF1a promoter sequence were PCR-amplified from pEB1-T-Sapphire and pLARRY-eGFP with primers homologous to the vector insertion site in a custom lentiviral plasmid backbone (Vectorbuilder) using Gibson assembly (Gibson assembly master mix, NEB, E2611L). After magnetic bead purification, ligated vectors were transformed into NEB10-beta electroporation ultracompetent Escherichia coli cells (NEB 10-beta electrocompetent E. coli, NEB, C3020K) and grown overnight on LB plates supplemented with 50 μg ml–1 carbenicillin (carbenicillin disodium salt, Thermo Scientific Chemicals, 11568616). Colonies were scraped using LB medium and pelleted by centrifugation. Plasmid maxipreps were performed using an Endotoxin-Free Plasmid Maxi kit (Macheray Nagel), following the manufacturer’s protocol. pEB1-T-Sapphire was a gift from P. Cluzel (Addgene plasmid 103977). pLARRY-eGFP was a gift from F. Camargo (Addgene plasmid 140025).
Barcode lentivirus library generation and diversity estimation
To barcode pLARRY plasmids and generate a library, a spacer sequence flanked by EcoRV restriction sites was cloned into the plasmid after the WPRE element of the vector. Custom PAGE-purified single-strand oligonucleotides with a pattern of 20 random-bases (GTTCCANNNNTGNNNNCANNNNGTNNNNAGNNNN) and surrounded by 25 nucleotides homologous to the vector insertion site were synthesized by IDT DNA Technologies. The assembly of these components and subsequent purification steps were carried out through Gibson assembly (Gibson assembly master mix, NEB, E2611L). Six electroporations of the bead-purified ligations were performed into NEB10-beta E. coli cells (NEB 10-beta electrocompetent E. coli, New England Biolabs, C3020K) using a Gene Pulser electroporator (Bio-Rad). After transformation, the cells were incubated at 37 °C for 1 h at 220 r.p.m. After incubation, the transformed cells were plated in six large LB–ampicillin agar plates overnight at 30 °C. Colonies from all six plates were collected by scraping with LB–ampicillin and then grown for an additional 2 h at 225 r.p.m. and 30 °C. Cultures were pelleted by centrifugation, and plasmids were isolated using an Endotoxin-Free Plasmid Maxi kit (Macheray-Nagel), following the manufacturer’s protocol.
For estimating diversity, LARRY barcode amplicon libraries were prepared by PCR amplification of the lentiviral library maxiprep using flanking oligonucleotides carrying TruSeq read1 and read2 adaptors using 10 ng of the library (Supplementary Table 4). We used the minimal number of cycles that we could detect by quantitative PCR to avoid PCR amplification bias (10–12 cycles). After bead purification, 10 ng of the first PCR product was used as a template for a second PCR to add Illumina P5 and P7 adaptors and indexes (Supplementary Table 4). Two independent PCRs were sequenced on an Illumina NovaSeq 6000 S4 platform (Novogene) to confirm diversity after correction of errors through collapsing with a Hamming distance of 4. After collapsing, libraries were confirmed to contain at least 50 million different barcodes, with enough diversity for uniquely labelling up to 100,000 HSCs with a minimal false-positive rate.
Lentivirus production and barcode labelling
Lentivirus production and HSPC transduction were performed as previously described8.
Experimental procedures (mouse study)
Mice and animal guidelines
All procedures involving animals adhered to the pertinent regulations and guidelines. Approval and oversight for all protocols and strains of mice were granted by the Institutional Review Board and the Institutional Animal Care and Use Committee at Parque Científico de Barcelona under protocols CEEA-PCB-22-001-ARF-P1 and CEEA-PCB-22-002-ARF-P2. The study followed all relevant ethical regulations. CD45.1 (CD45.1, B6.SJL-Ptprca Pep3b/BoyJ, 002014, The Jackson Laboratory) mice were used as transplantation recipients for CD45.2 (BL6/J) donor cells. Mice were kept under specific-pathogen-free conditions for all experiments. We used 12–100-week-old male and female mice for our experiments. Neither randomization nor blinding was used. Experiments were performed with one or two biological replicates of mice, and no statistical methods were used for sample size choice. To minimize distress, euthanasia was performed by administering isoflurane inhalation, followed by cervical dislocation to ensure the animals were fully deceased.
LARRY lentiviral barcoding and transplantation
Following euthanasia, bone marrow was collected from the femur, tibia, pelvis and sternum through mechanical crushing, ensuring the retrieval of most of the cells. The collected bone marrow cells were then sieved through a 40 μm strainer and cleansed with a cold ‘Easy Sep’ buffer containing PBS, 2% FBS, 1 mM EDTA and penicillin–streptomycin followed by lysis of red blood cells using RBC lysis buffer (BioLegend, 420302). At first, mature lineage cells were selectively depleted using a Lineage Cell Depletion kit, mouse (Miltenyi Biotec, 130-110-470), and the resulting LIN– (lineage-negative) fraction was then enriched for KIT expression using CD117 MicroBeads (Miltenyi Biotec, 130-091-224). These KIT-enriched cells were washed, blocked with FcX and stained with the following fluorescently labelled antibodies: APC anti-mouse CD117, clone ACK2 (BioLegend, 105812); PE/Cy7 anti-mouse Ly6a (SCA1) (BioLegend, 108114); Pacific Blue anti-mouse Lineage cocktail (BioLegend, 133310); PE anti-mouse CD201 (EPCR) (BioLegend, 141504); PE/Cy5 anti-mouse CD150 (SLAM) (BioLegend, 115912); and APC/Cy7 anti-mouse CD48 (BioLegend, 103432). For transplants, EPCR+LIN–SCA1+KIT+CD48–CD150+ HSCs were sorted by FACS with a BD FACSAria Fusion with a 70 µm nozzle.
In vitro cultures of HSCs were done under self-renewing F12-PVA-based conditions as previously described52. To culture HSCs, 96-well flat-bottom plates from Thermo Scientific were coated with a layer of 100 ng ml–1 fibronectin (bovine fibronectin protein, 1030-FN) for 30 min at room temperature. After the sorting process, HSCs were transferred into 200 µl complete HSC medium supplemented with 100 ng ml–1 recombinant mouse TPO (PeproTech Recombinant Murine TPO, 315-14) and 10 ng ml–1 recombinant mouse SCF (PeproTech Recombinant Murine SCF, 250-03) and grown at 37 °C with 5% CO2. During lentiviral library transduction, the first medium change took place 24 h after transduction. Three days after labelling, the cultured HSCs were collected and subsequently transplanted into conditioned CD45.1 mice. The CD45.1 recipient mouse was preconditioned with a lethal X-ray radiation dose, administered as two separate sessions amounting to 5 Gy each, with a 4-h interval between them. To assess the engraftment of donor cells, the percentage of CD45.2+ peripheral blood leukocytes (and the percentage of fluorescent-protein-labelled cells) was determined. All mice demonstrated stable long-term engraftment until the experimental end point. Engraftment analysis, along with the measurement of labelling frequency, was carried out using BD FACS Fusion.
Collection of cells for single-cell characterization
In all single-cell experiments, unless described otherwise in the subsequent sections, transplanted or untreated mice were euthanized at specified ages and time points after transplant, and a KIT-enriched cell fraction was isolated from the femur, tibia, pelvis and sternum, per the protocol described above. This KIT-enriched cell population was stained with FcX block to prevent nonspecific binding and subsequently stained again with the following panel of fluorescently labelled antibodies: APC anti-mouse CD117 (clone ACK2, BioLegend, 105812); PE/Cy7 anti-mouse Ly6a (SCA1) (BioLegend, 108114); and Pacific Blue anti-mouse Lineage cocktail (BioLegend, 133310). In all mouse experiments, cells were also labelled with a custom TotalSeq-B antibody cocktail (Supplementary Table 2). After staining, distinct cellular compartments were sorted as illustrated in Supplementary Fig. 1 and profiled by scTAM-seq (see below).
LARRY experiments
For validating EPI-Clone using a ground-truth genetic lineage-tracing experiment, we performed two experiments: the main LARRY experiment (M.1) and the LARRY replicate experiment (M.2) (Figs. 1 and 2 and Extended Data Figs. 2 and 3). For M.1, two donor mice were killed, and HSCs were labelled with LARRY constructs containing a GFP label in one case and LARRY constructs containing a Sapphire label in the other case. Subsequently, labelled cells from each donor were transplanted into two recipient mice each. Accordingly, the dataset contains cells from four mice that contain two sets of clones, labelled with GFP and Sapphire, respectively. GFP and Sapphire clones did not mix on EPI-Clone UMAPs (Extended Data Fig. 3f), which further demonstrates that clones identified using EPI-Clone are individual-specific. We profiled all four recipient mice after allowing full blood reconstitution over 5 months. We also repeated this experiment again for validating the computational method (experiment M.2) using only one donor mouse. For both experiment M.1 and experiment M.2, we collected LSK and LK cells from the bone marrow and mixed them at 60,000 (LK) plus 50,000 (LSK) before analysing the cells using the Tapestri platform (Supplementary Table 1).
Native haematopoiesis
In this experiment (M.3; Fig. 1), we killed a 12-week-old wild-type BL6/J (CD45.2) mouse, extracted 120,000 LK cells and subjected them to scTAM-seq (Supplementary Table 1 and Supplementary Fig. 1).
Mature myeloid cell experiment
For profiling tissue-resident myeloid cells (experiment M.4; Extended Data Fig. 4), a single LARRY-transplanted mouse was anaesthetized 10 months after transplantation and perfused. Subsequently, lungs were extracted from the chest cavity, and a single-cell suspension was prepared using a protease and DNAse solution from a Lung Dissociation kit (Miltenyi Biotech, 130–095-927) followed by mechanical dissociation using gentleMACS ‘C’ columns (Miltenyi Biotech, 130–093-237) according to the manufacturer’s instructions. The dissociated cells were filtered using a 70 μm strainer and centrifuged at 400g for 5 min at room temperature. The supernatant was removed by aspiration and red blood cell lysis was performed using RBC lysis buffer (BioLegend, 420302). Cells were then washed with FACS buffer and pelleted at 400g for 5 min at 4 °C. The supernatant was removed, and the pellet was resuspended in FACS buffer before being passed through a 40 μm strainer and stained for the mature myeloid cell marker. Cells were stained with the following fluorescently labelled antibodies: PerCP/Cyanine5.5 anti-mouse/human CD11b (BioLegend, 101227; clone M1/70) and PE/Cyanine7 anti-mouse CD45.2 (BioLegend, 109829; clone 104). Cells were also labelled with TotalSeq-B antibody cocktail. We then sorted CD45.2+CD11b+LARRY(GFP)+ immune cells from lung. In parallel, we also sorted and stained LSK and LK cells and mature CD11b+ populations from both bone marrow and peripheral blood, followed by single-cell profiling.
Mature immune cell experiment
For this experiment (M.5; Fig. 2 and Supplementary Fig. 4), a single LARRY-transplanted mouse was euthanized 5 months after transplantation, and cells from the spleen and bone marrow were collected as described above. After red blood cell lysis, equal amounts of cells from both organs were pooled, washed and blocked with FcX. The cells were then stained with the following fluorescently labelled antibodies: Pacific Blue anti-mouse FcεRIα (BioLegend, 134313; clone MAR-1); PE/Cyanine5 anti-mouse CD19 (BioLegend, 115509; clone 6D5); Brilliant Violet 605 anti-mouse CD11c (BioLegend, 117333; clone N418); PerCP/Cyanine5.5 anti-mouse/human CD11b (BioLegend, 101227; clone M1/70); APC/Cyanine7 anti-mouse Ly-6G (BioLegend, 127623; clone 1A8); APC anti-mouse CD3 (BioLegend, 100235; clone 17A2); and PE/Cyanine7 anti-mouse CD115 (CSF-1R) (BioLegend, 135523; clone AFS98). We then sorted the following populations from LARRY+ live cells based on their surface markers: T cells (CD3+CD19–), B cells (CD3–CD19+), neutrophils (CD11b+CD3–CD19–Ly6G+), monocytes (CD11b+CD3–CD19–Ly6G–CD115+) and eosinophils and basophils (CD11b+CD3–CD19–FcεR1a+).
Lung EC experiment
For profiling ECs from 100-week-old mice (experiment M.6; Extended Data Fig. 5), dissociated lung cells were collected as described above. The resultant cell population was then enriched for CD31 expression using CD31 MicroBeads (mouse, 130-097-418, Miltenyi Biotec) per the manufacturer’s guidelines. These CD31-enriched cells were then washed, blocked with FcX and stained with the following fluorescently labelled antibodies: PE anti-mouse CD31 (BioLegend, 102507; clone MEC13.3) and PE/Cyanine7 anti-mouse CD45.2 (BioLegend, 109829; clone 104). Following staining, CD31+ and CD45.2– cells were sorted as illustrated in Extended Data Fig. 5a and Supplementary Table 1.
Native haematopoiesis experiments in old and young mice
For this experiment (M.7; Fig. 3 and Supplementary Figs. 5–7), the KIT-enriched cell fraction was stained and subsequently sorted to collect LSK and LK populations as described above. Samples were collected from two young (12-week-old) BL6/J (CD45.2) mice and two aged (100-week-old) BL6/J (CD45.2) mice.
LARRY transplant experiments in old mice
For this experiment (M.8; Fig. 3 and Extended Data Fig. 6), half of the HSC population from a 100-week-old mouse that was profiled as part of experiment M.7 were labelled with LARRY lentivirus and transplanted into lethally irradiated mice. Six months after transplant, mice were euthanized, and a KIT-enriched cell fraction was isolated from the femur, tibia, pelvis and sternum, following the protocol outlined above. This KIT-enriched cell population was stained with FcX block to prevent nonspecific binding and subsequently stained again with the following panel of fluorescently labelled antibodies: APC anti-mouse CD117 (clone ACK2, BioLegend, 105812); PE/Cy7 anti-mouse Ly6a (SCA1) (BioLegend, 108114); and Pacific Blue anti-mouse Lineage cocktail (BioLegend, 133310). After staining, LK and LSK cells were sorted as described above.
Experimental procedures (human study)
Human samples and their previous characterization by genomic assays
Bone marrow samples were obtained from different sources. Samples A.1, A.6 and A.7 were bone marrow aspirates from healthy volunteers collected at the Heidelberg University Hospital after informed written consent. This study was approved by the Ethics Committee of the Medical Faculty of Heidelberg University (S-480/2011). Sample A.4 was a TBM sample obtained through the Banc de Sang i Teixits (Barcelona, Spain) and approved by the Ethics Committee of the Hospital Clinic de Barcelona (HCB/2023/0367). Samples B.1–B.5 and X.1 were commercially available samples of purified CD34+ cells from organ donors (Ossium Health). No genomic characterization was performed on these samples before this study. Samples A.2, A.3 and A.5 were collected after informed written consent from individuals undergoing elective total hip replacement surgery at the Nuffield Orthopaedic Centre under the ‘Mechanisms of Age-Related Clonal Haematopoiesis’ (MARCH) study. This study was approved by the Yorkshire and The Humber–Bradford Leeds Research Ethics Committee (NHS REC ref: 17/YH/0382). These samples were screened for somatic mutations with a variant allele frequency of ≥0.01 by targeted DNA sequencing of a panel covering 97 genes (347 kb) recurrently mutated in myeloid malignancies and CH, as previously described43. Samples with somatic mutations in DNMT3A and PPM1D were selected for analyses. Finally, sample X.2. was a peripheral blood sample collected and characterized by mt-scATAC–seq as previously described46. Informed consent was given and approved for genomics profiling by the Stanford Institutional Review Board (number 14734).
All experiments involving human samples were approved by the corresponding ethics committees and were in accordance with the Declaration of Helsinki.
Bone marrow samples were thawed and stained using CD34 and CD3 sorting antibodies (BioLegend, 343517) and a pool of oligonucleotide-conjugated antibodies from the TotalSeq-D Heme Oncology Cocktail from BioLegend (MB53-0053) as well as additional TotalSeq-D antibodies from BioLegend (Supplementary Table 6). Samples were then sorted for CD34+ and CD34– populations and subjected to scTAM-seq (see below). For details on sorting, see Supplementary Table 1.
Multiplexing
Samples B.1 and B.5, B.2 and B.4, and A.2 and A.5 were in pairs, multiplexed into single Tapestri lanes. Demultiplexing was performed on the basis of germline single-nucleotide polymorphisms on autosomes with vireo53. Chromosome Y and pre-characterized somatic single-nucleotide variants (SNVs) were used as controls (Supplementary Fig. 10 and see the section on ‘Bioinformatic analysis (human)’).
Single-cell DNA methylation profiling with scTAM-seq
Single-cell DNA methylation profiling
For profiling DNA methylation at single-cell resolution, we used scTAM-seq5, which leverages the Mission Bio Tapestri technology to investigate up to 1,000 CpGs in 1,000s of cells per experiment. In brief, we loaded 120,000–140,000 cells into the Tapestri machine and followed the default Mission Bio DNA+Protein protocol for V2/V3 chemistry for the experiments (v.2: https://missionbio.com/wp-content/uploads/2021/02/Tapestri-Single-Cell-DNA-Protein-Sequencing-V2-User-Guide-PN_3360A.pdf; v.3: https://missionbio.com/wp-content/uploads/2023/08/Tapestri-Single-Cell-DNA-Protein-Sequencing-v3-User-Guide_MB05-0018.pdf; see also Supplementary Table 1), but with the following modifications: (1) we added a DNA methylation-sensitive restriction enzyme (HhaI) to digest non-methylated targets (CpGs) before amplification; and (2) in the case of the mouse experiments, we used TotalSeq-B antibodies and different primers for the amplification of antibody oligonucleotide tags. The default Mission Bio protocol uses a different type of oligonucleotide tag, TotalSeq-D, which we used here for the experiments using human samples, but which are currently not available for mouse antigens.
For the mouse samples stained with TotalSeq-B, we added 5 μl of highly concentrated HhaI (150,000 U ml–1, NEB) enzyme and 2 µl of 30 µM of a custom antibody tag primer specific for the amplification of the oligonucleotide tags of TotalSeq-B antibodies (ACTCGCAGTAGTCTTGCTAGGACCGGCCTTAAAG) to the Tapestri barcoding mix reagent. An incubation at 37 °C for 30 min was added to the start of the targeted PCR thermal cycling program to allow for the restriction enzyme digest to take place before the PCR amplification step. The use of TotalSeq-B antibodies primarily affected the ‘Protein Library Cleanup I’ section of the protocol, for which we replaced the 2× binding and washing (B&W) buffer from the kit with the following buffer prepared with nuclease-free water: Tris-HCl (final concentration 10 mM, pH 7.5), EDTA (final concentration 1 mM) and NaCl (final concentration 2 M). We used 2 µl of 5 µM of our custom biotin oligonucleotide (/5Biosg/GTGACTGGAGTTCAGACGTGTG/3C6/) to isolate the antibody tags. Moreover, during the isolation of antibody tags, we performed the second wash of streptavidin beads with 1 ml nuclease-free water instead of 1× B&W buffer. Finally, each tube of streptavidin beads was resuspended in 45 µl of nuclease-free water then transferred and combined into a new tube for a total of 90 µl. To amplify the final protein target library, we used 5 µl of 4 µM of each custom indexed primers (forward: CAAGCAGAAGACGGCATACGAGAT[i7 index]GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT; reverse: AATGATACGGCGACCACCGAGATCTACAC[i5 index]TCGTCGGCAGCGTC). Typically, we performed twice as many reactions to amplify the DNA target library, but this may be increased to achieve sufficient yield. Last, we adjusted the AMPure XP reagent-to-sample ratio in the second size-selection step in the ‘DNA Library Cleanup II’ section from 0.72× to 0.65×.
For the human samples stained with TotalSeq-D, we followed the scTAM-seq protocol as previously described5.
Using the stained cells that we used as input to scTAM-seq, we also performed 10x Genomics Chromium Single Cell 3′ for transcriptomic profiling of the cells, following the standard protocol. This step was exclusively performed for experiment M.1 (Supplementary Table 1). For the transcriptomic data, LARRY barcodes were later amplified using a modified version of the protocol8 (see Supplementary Table 4 for an updated list of primers).
Mouse panel design for scTAM-seq
We aimed to design a panel with CpGs dynamically methylated in HSCs, as well as in more committed progenitors (MPPs). We collected bulk whole-genome bisulfite sequencing data from a previous publication22 profiling DNA methylation in three replicates of HSCs (LSK and CD135−CD48−CD150+CD34−), MPP1 (LSK and CD135−CD48−CD150+CD34+), MPP2 (LSK and CD135−CD48+CD150+CD34+), and a mixture of MPP3 (LSK and CD135−CD150−CD48+CD34+) and MPP4 (LSK and CD135+CD150−CD48+CD34+). Using these data, we selected CpGs that were variably methylated in HSPCs (Extended Data Fig. 1b,c) using three criteria: (1) CpGs differentially methylated between the HSCs and the different MPP populations (DMCs); (2) CpGs intermediately methylated within HSCs (IMCs); and (3) CpGs harbouring within-sample heterogeneity in HSCs (WSHs). The code for selecting CpGs is available from GitHub (https://github.com/veltenlab/EPI-CloneSelection).
For DMCs, we used RnBeads54 to determine CpGs that were specifically methylated in one of the HSPCs (that is, in HSC, MPP1, MPP2 or MPP3/MPP4) but not methylated in all the remaining HSPC populations. We only focused on CpGs that were covered by at least 10 sequencing reads in all samples and that had a methylation difference of at least 0.2 between the target cell type and the average of the remaining cell types.
IMCs had to be non-overlapping with DMCs and were then defined by a DNA methylation level in the bulk samples (HSCs) between 0.25 and 0.75. Such CpGs may be differentially methylated between two sub-cell types of HSCs. IMCs were required to have a low proportion of discordant reads (PDRs)55 together with a high quantitative fraction of discordant read pairs (qFDRPs)56. PDR and qFDRP are measures of WSH in bulk bisulfite sequencing data and quantify the concordance of methylation states on the same sequencing read (PDR) or of multiple CpGs across different sequencing reads (qFDRP).
CpGs with high WSH were non-overlapping with DMCs and IMCs. The CpGs were then identified on the basis of the high levels of both PDR and qFDRP. These CpGs were therefore located in regions showing variable methylation profiles in bulk sequencing data and might represent regions with stochastic methylation in HSCs.
After identifying all CpGs that fulfilled the above criteria, we excluded any CpGs that were not in the context of a HhaI cut site and enriched the selected CpGs for those located in the vicinity (100 bp) of at least one TFBS of an important haematopoietic transcription factor (Supplementary Table 5). We then selected 105 CpGs specifically methylated in HSCs, 70 in MPP1, 70 in MPP2, 75 in MPP3/MPP4, 210 IMCs and 80 WSH (Extended Data Fig. 1b,c). We also included the following control amplicons: 20 constitutively methylated, 20 constitutively unmethylated and 50 amplicons without a HhaI cut site. Control amplicons were required to identify cells from the data because the remaining amplicons were digested depending on their methylation state. We uploaded this list to the Mission Bio Designer tool (https://designer.missionbio.com/) to receive a final list of 663 amplicons and corresponding primer sequences (Supplementary Table 5). The CpGs were further annotated according to their location in the genome with respect to chromatin states as previously defined57. From the 573 non-control amplicons, a subset of 453 amplicons with low dropout rate in an experiment without HhaI digest was used for analysis.
For amplifying the LARRY barcodes, we spiked in an additional primer into the primer pool targeting the LARRY barcode sequence (forward: GCATCGGTTGCTAGGAGAGA; reverse: GGGAGTGAATTAGCCCTTCCA). We could therefore read out the LARRY barcode together with information about the DNA methylation state from the same single cell.
Human panel design for scTAM-seq
The design for the human panel closely followed the strategy applied for the mouse panel. Two previously published datasets21,58 were used to similarly profile DMCs, IMCs and, additionally, CpGs with interindividual heterogeneity (IIH). Sites were selected to not include single-nucleotide polymorphisms according to dbSNP v.151 and to be located in the HhaI cut sequence.
For DMCs, we considered peripheral blood and bone marrow samples from a previous study21. Samples with an average coverage across all CpGs below 1 were removed. DMCs between HSCs, MPPs, multilymphoid progenitors (MLPs; combining MLP0, MLP1, MLP2 and MLP3), common lymphoid progenitors (CLPs), common myeloid progenitors (CMPs) and GMPs were computed using RnBeads. CpGs with a mean methylation difference higher than 0.1 between the cell types were identified as DMCs.
We performed IMC detection on HSC-enriched lineage-negative (LIN−CD34+CD38−) samples from eight male donors using a previously published dataset58. To deal with data sparsity, we set the maximum quantile of missing values per site to 0.005 and removed any sites that exceeded this threshold. IMCs were defined as CpGs with a DNA methylation level between 0.25 and 0.75 in at least 5 samples. When checking for a HhaI cut site, we allowed for a maximum of 25 CpG sites in the extended region around the IMC.
For CpGs with IIH, we used the same dataset to identify CpGs with a variance higher than 0.1 across all individuals of the dataset from ref. 58.
We also created genotyping amplicons that cover mutations in ASXL1, DNMT3A, TET2, TP53, JAK2, IDH2, PPM1D, SF3B1, IDH1 and SRSF2. We used 62 amplicons covering these genes from the Tapestri single-cell DNA myeloid panel by Mission Bio (https://missionbio.com/products/panels/myeloid/) as a base panel, excluding amplicons with the HhaI restriction sequence GCGC. We designed further amplicons for exons in the aforementioned genes that had a coverage of less than 60% in the default myeloid panel. To prevent these amplicons from having a recognition site, we performed a virtual digestion of the exonic sequences using the HhaI cut sequence. We then uploaded a list containing the fragmented genomic regions to the Mission Bio Designer tool, which resulted in 82 additional amplicons. We also included 20 amplicons targeting chromosome Y and 50 control amplicons without a HhaI cut-sequence. We uploaded the CpG targets and readily designed genotyping, chromosome Y, and control amplicons using the Mission Bio Designer tool. The final list comprises 665 amplicons and corresponding primer sequences. The resulting 448 CpG targeting amplicons are divided into 215 DMC, 145 IMC and 88 IIH amplicons (Supplementary Table 6).
Sequencing
Libraries were sequenced on an Illumina NovaSeq 6000 with 2 × 100 bp (scTAM-seq mouse), 2 × 150 bp (scTAM-seq human), 2 × 50 bp (scRNA-seq) and 2 × 50 bp (protein libraries) reads. For an overview of the sequencing statistics, see Supplementary Table 7.
Combined profiling of DNA methylation and RNA in the same cell
To jointly profile DNA methylation and RNA in the same cell (scTAMARA-seq, experiment X.1), we took advantage of the recently published SDR-seq method45 and combined it with scTAM-seq. In total, we profiled 120 RNA and 367 gDNA (200 DNA methylation and 167 genotyping) targets (Supplementary Table 6). DNA methylation targets were a subset of the original set, excluding amplicons that were not identified as consensus static or dynamic CpGs in the total bone marrow original cohort and low-performing amplicons. RNA targets were selected from a RNA-seq reference map59 using LASSO regression to identify 120 RNAs most predictive of cell state in the CD34+ compartment. We followed the SDR-seq protocol45 using the glyoxal fixation condition. Once the cells had been fixed, permeabilized and reverse-transcribed, they were loaded onto the Mission Bio Tapestri platform and processed as for scTAM-seq. The final RNA and DNA sequencing libraries were individually generated as previously described45.
Combined profiling of DNA methylation and mitochondrial variants
For this experiment (scTAMito-seq, experiment X.2), we performed scTAM-seq using the same 367 DNA methylation and genotyping amplicons as for scTAMARA-seq. We spiked in a pre-designed mitochondrial panel (https://designer.missionbio.com/catalogpanels/Virtual-mtDNA) at a ratio of 1:20 as previously described60.
Bioinformatic analysis (mouse)
Data processing
For processing of raw data, we used a modified pipeline that was based on the originally described pipeline for scTAM-seq5 (https://github.com/veltenlab/scTAM-seq-scripts), which is available from GitHub (https://github.com/veltenlab/EPICloneProcessing). In brief, cellular barcodes were extracted from the raw sequencing files before alignment to the reference genome subset to the CpG panel. Reads mapping to each of the amplicons were quantified to generate a count matrix, and DNA methylation states were determined using a cut-off value of one sequencing read as in the original scTAM-seq publication5. We used those cellular barcodes that had more than 10 sequencing reads in at least 70% of the control (non-HhaI) amplicons. Doublets were removed using the DoubletDetection tool (https://zenodo.org/record/2678042).
At the single-cell level, we differentiated methylated from unmethylated CpGs through the presence of at least one sequencing read for the corresponding amplicon as in the original publication of scTAM-seq5. Sequencing reads can uniquely originate from amplicons with methylated CpGs, whereas the lack of sequencing reads from an amplicon originates either from an unmethylated CpG or from a dropout. To minimize the effect of dropout, we determined the primer combinations that reliably amplified in our panel using a single experiment without the restriction enzyme. For this experiment in mice, LIN−KIT+ cells from a young, wild-type mice (12 weeks) were used and we determined that 453 out of the 573 non-control amplicons (79%) amplified in more than 90% of the cells. These amplicons were used for subsequent analyses.
For the surface-protein data, the Mission Bio pipeline was used to extract sequencing reads for a particular cell-barcode–antibody-barcode combination. We restricted analyses of the protein data to those cellular barcodes identified in the DNA methylation library.
Processing of LARRY barcodes
LARRY barcodes could be directly identified from the scTAM-seq sequencing library because an additional primer pair capturing the LARRY barcode was included (see the section ‘Mouse panel design for scTAM-seq’). Sequencing reads mapping to the amplicon with the LARRY barcode were extracted from the raw sequencing reads using the fluorophore sequence GCTAGGAGAGACCATATGGGATCCGAT. The LARRY barcode was determined using the base pairs following the GFP sequence, given that the sequence matches the rules by which the LARRY barcode was constructed (see the section ‘Barcode lentivirus library generation and diversity estimation’). Barcode extraction was performed using a modified version of the scripts provided in the original LARRY publication8 (https://github.com/AllonKleinLab/LARRY). Barcodes supported by fewer than five sequencing reads were discarded, and LARRY barcodes with a Hamming distance lower than three were merged for each of the experimental batches individually.
Notably, each cell can have more than one unique LARRY barcode owing to multiple lentiviral infections. In these cases, groups of LARRY barcodes were jointly passed on to the progeny. To call clones in this setting, we computed for any pair of LARRY barcodes the extent to which these two barcodes were observed in an overlapping set of cells (formally a Jaccard index). LARRY barcodes were then clustered according to this distance metric. We used a permutation test to determine LARRY barcodes that are merged together to a clone. When LARRY barcodes were merged, cells were assigned to the merged clone if any constituent LARRY barcode was observed.
Data integration and annotation of cell states
We constructed Seurat61 objects for each of the scTAM-seq samples individually using the binary DNA methylation matrix. To integrate all the samples from experiments M.1–M.3, we used Seurat’s IntegrateData62 function. Then we used Seurat’s standard workflow without normalization to obtain a low-dimensional representation of our data using UMAP. We removed cells in low-density parts of the UMAP because we found that these cells were of lower quality using the non-digested control amplicons. To annotate the cell-type clusters we obtained as result of the Seurat workflow, we inspected the expression of surface proteins, the DNA methylation states of CpGs in bulk data and the DNA methylation states of important lineage-specific transcription factors. To compare single-cell to bulk DNA methylation we computed the relative methylation state by dividing the average methylation state of all CpGs in the given group of bulk data (for example, HSC-specific) by the mean methylation state of all CpGs. To that end, we performed differential analysis for each cell-type cluster individually and selected CpGs with a log fold change larger than 1 for each cluster. For those sites, we investigated whether they are in the vicinity (100 bp) of any of the 39 transcription factors in Supplementary Table 5 and computed enrichment P values with the Fisher exact test. A full vignette is available from GitHub (https://github.com/veltenlab/EPI-clone).
All remaining experiments (M.4–M.8) were analysed without batch correction, as the samples were processed as single batches. Annotation of the cell-type clusters was performed in dynamic CpG space (see below) using bulk methylation values, demethylation of TFBSs and surface-protein expression. In addition to this information, for experiments M.7 and M.8, we projected cell-type labels from the initial analysis (experiments M.1–M.3) using scmap63. EPI-Clone was then used with the standard parameters as described below.
For processing the protein data of scTAM-seq, we used the centred-log-ratio (CLR) normalization methods. To generate a low-dimensional representation of the protein data only, we opted to use SCTransform64, which produced an improved cell-state resolution.
For the scRNA-seq dataset, we used cellranger to generate transcriptomic and surface-protein count matrices, which were used as input to Seurat. Harmony65 was used for batch integration and the cell-type annotation was performed using known haematopoietic marker genes together with the expression of surface proteins.
EPI-Clone
The EPI-Clone algorithm is divided into three steps: (1) identification of static CpGs, (2) identification of cells from expanded clones and (3) clustering of cells from expanded clones. A detailed, step-by-step vignette is available from GitHub (https://github.com/veltenlab/EPI-clone; v.2.0 used in this article). A brief description is given below.
-
1)
To identify static CpGs, for each combination of CpG and surface protein, EPI-Clone performs a Kolmogorov–Smirnov test to investigate whether cells with methylated CpG differ in surface-antigen expression relative to cells with unmethylated CpG. CpGs with no significant antigen association (determined by the lowest P value for any of the surface proteins) according to a Bonferroni criterion were then selected if their average methylation across all cells was less than 90% but higher than 25% in mouse and higher than 5% in human. In the main LARRY experiment in Figs. 1 and 2, this resulted in the identification of 110 CpGs, which we annotated for enhancer–heterochromatic regions57 and early–late-replicating domains66.
-
2)
To identify cells from expanded clones, cells stemming from an expanded clone should be in a higher density region of the space defined by the static CpGs than cells stemming from non-expanded clones. A density estimate was therefore computed as follows. PCA was performed on all static CpGs from step (1). In the reduced dimensional space obtained by the first n = 100 principal components, the average Euclidean distance to the k = 5 nearest neighbours was determined. Effects of cell state, batch and sequencing depth on this measure of local density were then removed by linear regression. We observed that smoothing the resulting quantity locally over 20 nearest neighbours additionally improved performance. Optimal parameters n and k of this step, as well as the density threshold for a cell to be classified as stemming from an expanded clone, were identified through a systematic grid search on experiment M.1, using LARRY barcodes as a ground truth. Here clones >0.25% in size were defined as expanded.
-
3)
To cluster expanded clones, cells from expanded clones were clustered using the standard Seurat workflow, again in a space spanned by n = 100 principal components.
The parameters of steps (2) and (3) were established on the basis of the original LARRY experiment (M.1: LARRY main experiment) and used for all subsequent analyses of the mouse haematopoietic system (M.2–M.5, M.7 and M.8) without further adjustments. Static CpGs were defined in experiment M.1 and used for all remaining experiments. In particular, the performance on a replicate LARRY ground-truth experiment (M.2) is analysed in Extended Data Fig. 3.
In the native ageing experiment (M.7), we opted for a more conservative threshold for defining large, expanded clones (1%), as native haematopoiesis is more polyclonal than the transplantation setting. This threshold resembled what we found in human native haematopoiesis, using CH mutations and mitochondrial mutations as a partial ground truth (Fig. 4).
For when no or only a partial ground truth was available (mouse endothelia (M.6), see below for more details, and the human analysis), we opted instead for a parameter-free approach to identify expanded clones. We used a recently published clustering method, CHOIR44, which automatically determines clusters that have statistical support in the data. Unlike the density-based criterion, CHOIR does not have free parameters (for example, number of principal components, density threshold, number of nearest neighbours to consider). We confirmed that on the mouse LARRY experiment, CHOIR had a similar quantitative performance to the density-based criterion at optimal parameter values (Extended Data Fig. 3g).
EPI-Clone of endothelial data
As the first step of this experiment (M.6), all CpGs were used for dimensionality reduction and clustering. Consequently, we identified a cluster of contaminating non-endothelial (CD31–) cells that we removed. We then used the dynamic CpGs defined in experiment M.1 to construct a cell-state map of ECs. The CLR-transformed protein levels enabled us to annotate ECs as capillary, Car4 or lymphatic, in concordance with transcriptomic references. Finally, the 110 static CpGs defined in experiment M.1 were used to identify clones in these lung ECs. Binary data were used as input for CHOIR44 using false-discovery rate adjustment. Only clones with a relative clone size greater than 1% are highlighted in the figures. For comparison with transcriptomic data, the Mouse LungMAP31 was downloaded from CELLxGENE Datasets (Mus musculus + Lung + 10×3′ v.2 + Smart-seq2) and subset for adult samples. The lung EC atlas67 was also downloaded (https://endotheliomics.shinyapps.io/ec_atlas/).
EPI-Clone of transplantation
To understand whether EPI-Clone robustly identifies clones before and after transplantation, we investigated replicate 2 (old mouse) of the M.7 experiment together with the transplanted mouse (experiment M.8). Notably, the HSCs that were barcoded with LARRY and used for transplantation were obtained from replicate 2 (old mouse) of experiment M.7 (donor mouse). We then performed EPI-Clone on the combined Seurat object of the transplanted mouse with its donor using the static CpGs identified in experiment M.1 without further adjustments of the parameters. Moreover, and to estimate the false-positive rate of this approach, we performed the same analysis using replicate 1 (old mouse) of experiment M.7. This mouse had no relationship with the transplanted mouse and we would not expect clonal clusters to have cells from both samples to appear from a joint analysis.
Bioinformatic analysis (human)
Data processing, demultiplexing and mutation calling
Data processing followed the methods described for mice. For sample pairs that had been multiplexed into a single Tapestri lane (B.1 and B.5, B.2 and B.4, and A.2 and A.5), vireo53 was used for donor deconvolution based on germline SNVs. SNVs were called with cellsnp-lite using a minimum allele frequency of 0.05 and a count threshold of 5. Donor assignments were validated by detecting the presence of the Y chromosome in cases when male and female donors had been multiplexed, and/or the presence of previously known donor-specific CH mutations (Supplementary Fig. 9).
For samples with previously characterized CH mutations, the mutational status of each cell was determined using a custom script written in pysam. Any cell for which more than 5% of reads covering the relevant genomic site displayed the CH mutation were classified as mutant. Cells with a low number of reads covering the site were excluded, using a read threshold that was determined as a function of total site coverage.
Additional CH mutations were identified using SComatic68 based on the assumption that T cells and B cells are depleted from CH mutations. For that purpose, the BAM files of the TBM cohort were split by cell type. Base counts per cell type were calculated using BaseCellCounter with a minimum mapping quality of 30 and a maximum depth proportional to the number of cells in each group. Beta-binomial parameters were estimated across 35,000 genomic sites to model the distribution of reference and alternate alleles. Final mutation calling was performed using BaseCellCalling, considering all identified cell groups and estimated beta-binomial parameters. Mutations of interest were then identified by comparing T cells to myeloid cells. This strategy led to the identification of CH mutations in donors A.4 and A.7. Finally we repeated the same analysis for the TBM and CD34+ cohorts and comparing cells that EPI-Clone had annotated as expanded clones to cells annotated as stemming from non-expanded clones. This enabled us to identify the CH mutation in donor B.5, and the DNMT3A(C666Y) variant in donor A.4.
Data integration
Unlike in mouse data, data integration across all CpGs in the human dataset did not effectively remove interindividual differences (for example, large CH clones still clustered apart). However, a larger set of CITE-seq antibodies was included in the human cohort. We therefore identified surface-protein associated (‘dynamic’) CpGs across all cells in the TBM cohort and performed data integration using three approaches: CITE-seq data alone, dynamic CpG data alone or a combination of both modalities concatenated into a single feature matrix. All three strategies produced similar results (Extended Data Fig. 7c). Notably, the inclusion of both modalities provided more consistency across donors than dynamic CpGs alone, and was less susceptible to technical variation within the donors than CITE-seq data alone. Data integration was performed using scanorama69, as it offered a higher biological resolution of cell types or cell states compared with Seurat integration. In the TBM cohort, we identified a cluster of overstained cells (positive for all antibodies) that were removed before further analyses.
EPI-Clone applied to human samples
The same strategy was followed as for mouse; however, several adjustments were made as described below.
EPI-Clone analyses were performed while excluding mature T cells and B cells, unless denoted otherwise.
For identification of static CpGs, we proceeded as described above for each donor from cohort A (TBM) individually. We then defined consensus static CpGs as those CpGs that were identified as ‘static’ in at least five donors. Eventually, the same set of 94 consensus static CpGs was used for EPI-Clone analysis in all samples. The use of consensus static CpGs in some donors led to substantial improvements in the performance of EPI-Clone with respect to the ground-truth clonal labels (for example, CH mutations). Moreover, it eliminated the need for a static CpG identification step in future studies, as it established a reference set of static CpGs.
We used CHOIR44 with false-discovery rate adjustment for identifying expanded clones, see the section on EPI-Clone (above)
Analysis of scTAMARA-seq data
For this experiment (X.1), DNA methylation data were projected to the CD34+ reference (Fig. 4b) using scmap63. The RNA-seq reads were processed as previously described45, and data were analysed using default Seurat routines. EPI-Clone was used on the DNA methylation data with identical settings to all other human samples.
Analysis of scTAMito-seq data
For this experiment (X.2), cell types were identified by clustering on all surface antigens. EPI-Clone was then applied using consensus static CpGs. Heteroplasmies of mitochondrial mutations that had previously been identified for that sample using mt-scATACseq46 were called in single cells using pysam by dividing the number of reads supporting the mutant allele by the total number of reads covering the site. Cells with fewer than ten reads on the site were excluded as potential dropout.
Data visualization
Plots were generated using the R packages ggplot2 (ref. 70) and ComplexHeatmap71. Boxplots are defined as follows: the middle line corresponds to the median; the lower and upper hinges correspond to first and third quartiles, respectively; the upper whisker extends from the hinge to the largest value no further than 1.5× the interquartile range (or the distance between the first and third quartiles) from the hinge and the lower whisker extends from the hinge to the smallest value at most 1.5× the interquartile range of the hinge. Data beyond the end of the whiskers are called ‘outlying’ points and are plotted individually. For computing lineage-specific output as shown in Fig. 3, we defined output as the fraction of all HSC/MPP1 or myeloid cells per EPI-Clone cluster compared with all HSC/MPP1 or myeloid cells per experiment. In the bubble plots of Fig. 3 and Supplementary Fig. 7, the radius of the circles scales with the square of frequency.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.