Genetic elements promote retention of extrachromosomal DNA in cancer cells

Cell culture

The GBM39 neurosphere cell line has been previously described⁶⁰: it is derived from a patient with glioblastoma undergoing surgery at the Mayo Clinic. The COLO320DM and K562 cell lines were purchased from the American Type Culture Collection (ATCC), and the GM12878 cell line was purchased from the Coriell Institute for Medical Research. The colorectal cancer cell line COLO320DM and the immortalized chronic myelogenous leukaemia cell line K562 were cultured in RPMI 1640 medium with GlutaMAX (Thermo Fisher Scientific, 61870127) supplemented with 10% FBS (Thermo Fisher Scientific, A3840002) and 1% penicillin–streptomycin (Thermo Fisher Scientific, 15140163). GBM39 cells were maintained in DMEM/F12 (Thermo Fisher Scientific, 11320082), B-27 supplement (Thermo Fisher Scientific, 17504044), 1% penicillin–streptomycin, human epidermal growth factor (EGF, 20 ng ml^–1; Peprotech, AF-100-15), human fibroblast growth factor (FGF, 20 ng ml^–1; Peprotech, AF-100-18B) and heparin (5 µg ml^–1; Sigma-Aldrich, H3149). The lymphoblastoid cell line GM12878 was grown in RPMI 1640 with GlutaMAX supplemented with 15% FBS and 1% penicillin–streptomycin. The COLO320DM live-cell imaging line was cultured in DMEM (Corning, 10-013-CV) supplemented with 10% FBS and 1% penicillin–streptomycin–glutamine (Thermo Fisher Scientific, 10378016). GBM39 neurospheres were previously authenticated by the Mischel Laboratory using metaphase DNA-FISH¹²; other cell lines obtained from the ATCC and Coriell were not authenticated. All cell lines tested negative for mycoplasma contamination.

Analysis of ecDNA hitchhiking in IF–DNA-FISH of anaphase cells

Analysis of ecDNA hitchhiking in IF–DNA-FISH of anaphase cells was performed on raw images used in a previous publication⁵. Mitotic cells were identified using Aurora kinase B, which marks daughter cell pairs undergoing mitosis, as previously described^5,6. Colocalization analysis for ecDNAs with mitotic chromosomes in GBM39 cells (EGFR ecDNA), PC3 cells (ecMYC), SNU16 cells (FGFR2 ecDNA and ecMYC) and COLO320DM cells (ecMYC) described in Fig. 1 was performed using Fiji (v.2.1.0/1.53c)⁶¹. Images were split into the FISH colour + DAPI channels, and the signal threshold was manually set to remove background fluorescence. DAPI was used to mark mitotic chromosomes, and FISH signals overlapping with mitotic chromosomes were segmented using watershed segmentation. Colocalization was quantified using the ImageJ-Colocalization Threshold program, and individual and colocalized FISH signals in dividing daughter cells were counted using particle analysis.

Retain-seq

We cloned random genomic sequences into the pUC19 plasmid backbone for the Retain-seq experiments. pUC19 is a simple, small (about 2.7 kb) vector that lacks a mammalian origin of replication and contains few sequences that could be immunogenic or have mammalian promoter or enhancer activity. Therefore, we considered that pUC19 represents an inert and selectively neutral backbone. Consequently, changes in plasmid persistence can be more confidently ascribed to insert sequences as opposed to backbone components under selection. To generate a pool of random genomic sequences, we first fragmented the gDNA of GM12878 cells via transposition with Tn5 transposase, produced as previously described⁶², in a 50-µl reaction with TD buffer⁶³, 50 ng DNA and 1 µl transposase. The reaction was performed at 37 °C for 5 min, and transposed DNA was purified using a MinElute PCR Purification kit (Qiagen, 28006). GM12878 human B lymphoblastoid cells were selected as the genome of origin owing to their relatively low copy-number variability and the presence of an EBV genome as a positive control; the majority of inserts ranged from 600 to 1,300 bp. The resulting mixture of gDNA fragments was then amplified using 500 nM forward (p5_pUC19_SmaI_20bp) and reverse (p7_pUC19_SmaI_20bp) primers using NEBNext High-Fidelity 2× PCR master mix (NEB, M0541L) followed by gel purification of DNA fragments between 400 bp and 1.5 kb. To insert the mixture of gDNA fragments into a plasmid, the pUC19 vector (Invitrogen) was linearized with SmaI, purified using NucleoSpin Gel and PCR Clean-up (Macherey-Nagel, 740609.250) and the genomic fragments were inserted into the backbone using Gibson assembly (New England Biolabs, NEB). The DNA product was electroporated into Endura Competent Cells (Biosearch Technologies, 60242-2) using a MicroPulser electroporator (Bio-Rad; default bacteria setting) following the manufacturer’s protocol, and the resulting mixed episome library was prepared using a HiSpeed Plasmid Maxi Kit (Qiagen, 12663). The analysis of representation of DNA sequences in this mixed episome library and the retained episomes in transfected cells is described below.

COLO320DM and K562 cells were seeded into a 15 cm dish per biological replicate at a density of 1 × 10⁷ cells in 25 ml of medium. GBM39 cells were seeded into a T75 flask at a density of 5 × 10⁶ cells in 25 ml of medium. Each cell line was incubated overnight. COLO320DM, GBM39 and K562 cells were transfected with 15 µg of an input mixed episome library using Lipofectamine 3000 transfection reagent following the manufacturer’s directions. In brief, 1.5 × 10⁷ GM12878 cells were electroporated with 50 µg input mixed episome library using the Neon Transfection system (Thermo Fisher Scientific, MPK5000). The cells were counted, centrifuged at 300g for 5 min and washed twice with PBS before resuspension in Neon Resuspension buffer to a density of 4.2 × 10⁶ in 70 µl of buffer. The input mixed episome library was also diluted to a density of 14 µg in 70 µl with Neon Resuspension buffer. Next, 70 µl of cell suspension and 70 µl of library were mixed and electroporated according to the manufacturer’s instructions using a 100 µl Neon pipette tip under the following settings: 1,200 V, 20 ms, 3 pulses. Five electroporation reactions were pooled per replicate of GM12878 Retain-seq screens.

Cells were incubated for 2 days before the first subculture to allow recovery from transfection, and then subcultured every 3–4 days afterwards as dictated by the doubling time of each cell line. Once each cell line reached a count of 100–400 million cells per replicate, we collected all but 10 million cells, which were maintained in culture and passaged in the same manner until all subsequent time points had been collected (for a maximum of 3 time points per cell line). Thus, COLO320DM cells were collected at days 7, 14 and 21 after transfection, with a total cell count of approximately 4 × 10⁸ cells at each time point, per replicate. GBM39 cells were collected at days 10, 20 and 30, with total cell counts of approximately 1.5 × 10⁸ per replicate. K562 cells were collected at days 6, 12 and 18, with cell counts of approximately 4.5 × 10⁸ per replicate. GM12878 cells were collected at day 12, with a cell count of approximately 2 × 10⁸.

The output plasmid library was extracted using a HiSpeed Plasmid Maxi kit (Qiagen, 12663) and concentrated to a final volume of 50 µl by isopropanol precipitation. DNA was precipitated with a 1:10 volume of 3 M sodium acetate and 2 volumes of isopropanol, chilled at 4 °C for 10 min and centrifuged at 15,000g for 15 min at 4 °C. The pellet was washed with 500 µl ice-cold 70% ethanol and dissolved in 50 µl Buffer EB (Qiagen, 19086).

To enrich for input mixed episome library inserts, a preliminary PCR amplification (PCR1) of 10 cycles using primers (at 500 nM) annealing to the pUC19 vector (forward: pUC19_SmaI_5prime_fwr; reverse: pUC19_SmaI_3prime_rev) were performed on the concentrated DNA using NEBNext High-Fidelity 2× PCR master mix (NEB, M0541L). Each PCR1 reaction used a maximum of 2 µg concentrated DNA as template, with reactions assembled successively until all concentrated DNA was consumed; all reactions for a given sample were pooled following PCR1 and purified using a NucleoSpin Gel & PCR Clean-up kit (Macherey-Nagel, 740611), resulting in PCR product 1. Owing to variabilities in the insert size and the amount of retained plasmid DNA in the output library, artificial over-representation of fragments caused by PCR overcycling represented a concern for subsequent sequencing. Thus, we used qPCR to identify the cycle before saturation and halted amplification at this point. For qPCR, 50 ng of DNA from PCR product 1, NEBNext High-Fidelity 2× PCR master mix, 500 nM forward and reverse primers (forward: p5_adapter_only; reverse: p7_adapter_only) and 1 µl of 25× SYBR Green I (diluted from 10,000× stock; Thermo Fisher Scientific, S7563) were used in a 50 µl reaction. The SYBR Green signal of amplification products was measured in technical triplicates per reaction using a Lightcycler 480 (Roche) and plotted against the cycle number to identify the PCR cycle before saturation. According to the cycle numbers identified by this qPCR step, we then performed PCR2 by amplifying PCR product 1 (50 ng DNA) using the same primers as for the qPCR with the following number of cycles: 5, 10 and 12 PCR cycles for days 7, 14 and 21, respectively, of the COLO320DM experiment; 5, 11 and 18 PCR cycles for days 10, 20 and 30, respectively, of the GBM39 experiment; 5, 11 and 17 PCR cycles for days 6, 12, and 18, respectively, of the K562 experiment; and 10 PCR cycles for day 12 of the GM12878 experiment. We also collected a day-17 time point from the GM12878 experiment (amplified using 16 PCR cycles) that was specifically used to study retention of the EBV FR element, as this time point was assumed to be more comparable to the second time point in other cell lines. Next, output DNA from this step (PCR product 2) was purified using a MinElute PCR Purification kit (Qiagen, 28006) and then transposed with Tn5 transposase produced as previously described⁶² in a 50 µl reaction with TD buffer⁶³, 50 ng DNA (PCR product 2) and 1 µl transposase. The reaction was performed at 50 °C for 5 min, and transposed DNA was purified using a MinElute PCR Purification it (Qiagen, 28006). The above PCR steps and transposition were also carried out on the input mixed episome library originally used for cell transfection, but with 25 ng of input mixed episome library for PCR1. According to the cycle numbers identified by this qPCR step, we then amplified PCR product 1 (1 ng DNA) over 9 PCR cycles (PCR2). Finally, the previous PCR steps and transposition were also performed on a dilution series of 10 ng, 1 ng, 0.1 ng, and 0.01 ng of input mixed episome library as PCR1 template DNA to standardize analysis of screen output across varying DNA amounts.

Sequencing libraries were generated using five rounds of PCR amplification on the transposed PCR product; 2 using NEBNext High-Fidelity 2× PCR master mix (NEB, M0541L) with primers with i5 and i7 indices, purified using a SPRIselect reagent kit (Beckman Coulter, B23317) with left-sided size selection (1.2×), and quantified using Agilent Bioanalyzer 2100. Libraries were diluted to 4 nM and sequenced on an Illumina NovaSeq 6000 platform.

Primer sequences are listed in Supplementary Table 2.

Retain-seq analysis

Adapter content in sequenced episome library reads were trimmed using Trimmomatic (v.0.39)⁶⁴. Reads were aligned to the hg19 genome using BWA MEM (v.0.7.17-r1188)⁶⁵ and PCR duplicates were removed using MarkDuplicates in Picard (v.2.25.3). Read counts were then obtained for 1-kb windows across the reference hg19 genome using bedtools (v.2.30.0). Windows with fewer than 10 reads in 1 kb in the input episome library were filtered out.

Next, read counts were normalized to total reads and scaled to counts per million. We filtered out blacklist regions of the genome⁶⁶ and windows with extreme outlier read counts in the input episome library (more than three standard deviations above the mean read count). To determine how genome coverage is affected by the input DNA amount, we measured read counts of 1-kb genomic bins from sequencing of serial dilutions of the input episome library. This serial dilution experiment showed consistent representation of DNA sequences down to 0.1 ng of input DNA, at which the genome representation was nearly identical to 1 ng and 10 ng of input DNA in the top 50% of genomic bins (Extended Data Fig. 1b; 0.01 ng showed substantial library dropout and signs of skewing). Therefore, we focused our subsequent analyses of Retain-seq data on time points at which at least 50% of genomic bins are represented (that is, above 10 reads in a 1-kb window). Data from GBM39 cells at day 30 showed low genome representation and were excluded from subsequent analyses. Data from K562 cells at day 18 showed a large drop in genome representation and were excluded from subsequent analyses (Extended Data Fig. 2a).

We then calculated the log₂[fold change] of each genomic window in each sample over the input episome library by dividing the respective counts per million followed by log-transformation. Regions of the background genome with copy-number amplification in cells that retain the episome library can increase the background sequencing reads that align to those regions. To remove such background genomic noise, we calculated the median log₂[fold change] values of the neighbouring windows ±5 kb from each 1-kb window and normalized the log₂[fold change] of each 1-kb window to its corresponding neighbour average. Thus, any enriched episome sequence was required to have increased signal both compared with the input level and with its neighbouring sequences in its position in the reference human genome. z scores were calculated using the formula z = (x – m)/s.d., where x is the log₂[fold change] of each 1-kb window, m is the mean log₂[fold change] of the sample, and s.d. is the standard deviation of the log₂[fold change] of the sample. z scores were used to compute upper-tail P values using the normal distribution function, which were adjusted with p.adjust in R (v.3.6.1) with the Benjamini–Hochberg procedure to produce false discovery rate values. To identify episomes enriched in various cell lines, we identified 1-kb windows with false discovery rate values of <0.1 in two biological replicates at any of the time points for sample collection.

Plasmid cloning

To individually validate retention elements, pUC19 (empty vector) was digested with SmaI. Then, the following six retention element sequences were PCR amplified via a two-step nested PCR from gDNA derived from the GM12878 cell line: RE-A, chromosome 7 (55,321,959–55,323,480); RE-B, chromosome 7 (55,432,848–55,434,854); RE-C, chromosome 8 (127,725,819–127,727,938); RE-D, chromosome 7 (56,032,209–56,033,389); RE-E, chromosome 7 (55,086,476–55,088,263); and RE-F, chromosome 7 (55,639,062–55,640,378). Each retention element was inserted into the empty vector by Gibson assembly using NEBuilder HiFi 2× DNA Assembly master mix (NEB, E2621L) in accordance with the manufacturer’s protocol. The resulting plasmids were named pUC19_RE-A, pUC19_RE-B, pUC19_RE-C, pUC19_RE-D, pUC19_RE-E and pUC19_RE-F, respectively.

To clone pUC19 plasmids containing the EBV tether (pUC19_FR) or the entire viral origin (tether and replicator; pUC19_oriP), the viral tether (FR element; EBV: 7,421–8,042) and viral origin (oriP; EBV: 7,338-9,312) sequences were PCR-amplified using the pHCAG-L2EOP plasmid (Addgene, 51783)⁶⁷ as a template and inserted into SmaI-digested pUC19 by Gibson assembly.

To clone pUC19 plasmids with two or three copies of a retention element (RE-C, chromosome 8 (12,7725,819–127,727,938); pUC19_2RE and pUC19_3RE), we digested pUC19_RE-C with HindIII and inserted a second copy of the retention element (amplified by PCR primers pUC19_2RE forward and pUC19_2RE reverse) by Gibson assembly to generate pUC19_2RE. To generate pUC19_3RE (three copies of the retention element), pUC19_2RE was digested with SacI and a third copy of the retention element (amplified by PCR primers pUC19_3RE forward and pUC19_3RE reverse) was inserted by Gibson assembly.

To clone the pUC19 plasmid containing the CMV promoter (pUC19_CMV), the CMV promoter was PCR-amplified (primers pUC19_CMV forward and pUC19_CMV reverse) using the pGL4.18 CMV-Luc plasmid (pGL4; Addgene, 100984)⁶⁸ as a template and inserted into HindIII-digested pUC19 by Gibson assembly. To clone the pGL4 vector containing a retention element (RE-C, chromosome 8 (127,725,819–127,727,938); pGL4_RE-C), we digested pGL4 with MfeI and BamHI for the backbone and PCR-amplified the retention element sequence from GM12878 gDNA (primers pGL4_RE1 forward and pGL4_RE1 reverse). The PCR product was gel purified, digested with BsaI and BamHI, and ligated to the vector backbone using the DNA Ligation Kit v.2.1 (Takara Bio, 6022) following the manufacturer’s protocol.

For cloning individual overlapping tiles of a retention element (RE-C, chromosome 8 (127,725,819–127,727,938), tiles were each 500 bp in length (with the first 250 bp overlapping with the previous tile and the latter 250 bp with the subsequent tile), and each tile was amplified by PCR using pUC19_RE-C as a template. pUC19 was digested with SmaI and each tile sequence was inserted by Gibson assembly.

The plasmids for live-cell imaging were designed on the basis of a previously published pGL4 vector for a dual luciferase assay²³. The vector contains a retention element (chromosome 8, (128,804,981–128,806,980), hg19) overlapping with the PVT1 promoter termed RE-G. To insert LacO repeats for imaging, we first inserted multiple enzyme sites (GTCGACTGTGCTCGAGAACACGGATCCTATGCTCGTACG) by Gibson assembly following digestion with BamHI. Next, the vector was digested with SalI and Bsiwi and ligated with an array of 256 LacO copies that was obtained through the digestion of a pLacO-ISce1 plasmid (Addgene, 58505)⁶⁹ with SalI and Acc65I. To create a control plasmid that does not contain the retention element, the vector was digested with KpnI and BglII. The plasmid sequences were verified by Sanger sequencing. The LacO repeats in the plasmids were further verified by agarose gel because of its large size. All enzymes and Gibson assembly mix were purchased from NEB. All primer sequences are listed in Supplementary Table 2.

qPCR analysis of plasmid retention

To assess the retention of individual plasmids transfected into cells, we seeded K562 or COLO320DM cells into 6-well plates at a density of 3 × 10⁵ cells in 3 ml of medium per well and incubated the cells overnight. The next morning, cells were transfected with 0.5 µg plasmid per well using Lipofectamine 3000 transfection reagent (Thermo Fisher Scientific) following the manufacturer’s protocol. In total, 6 × 10⁵ GM12878 cells were electroporated with 2 µg plasmid per well using a Neon transfection system. Cells were counted, centrifuged at 300g for 5 min and washed twice with PBS before resuspension in Neon resuspension buffer to a density of 4.2 × 10⁵ in 7 µl of buffer. The plasmid was also diluted to a density of 1.4 µg in 7 µl with Neon resuspension buffer. Next, 7 µl of cell suspension and 7 µl of plasmid were mixed and electroporated according to the manufacturer’s instructions using a 10 µl Neon pipette tip under the following settings: 1,200 V, 20 ms, 3 pulses. Two electroporation reactions were pooled per replicate and plated into a 12-well plate in 1.5 ml medium per well. Cell cultures were split every 2–4 days and fresh medium was added. To quantify plasmid DNA in cells at various time points, gDNA was extracted from cells using a DNeasy Blood & Tissue kit (Qiagen, 69504). qPCR was performed in technical duplicates using 50–100 ng gDNA, 2× LightCycler 480 SYBR Green I master mix (Roche, 04887352001) and 125 nM forward and reverse primers (primers pUC19_F and pUC19_R, annealing to the pUC19 vector backbone; for plasmids with the pGL4 vector backbone, primers pGL4_F and pGL4_R were used). Relative plasmid DNA levels were calculated by normalizing to GAPDH controls (primers GAPDH_F and GAPDH_R). DNA levels were further normalized to the day 2 levels to account for variability in transfection efficiencies and to cells transfected with an empty plasmid vector control. P values were calculated in R using Student’s t-tests by comparing the relative fold change of biological replicates at various time points with respect to the input levels at day 2. Primer sequences are listed in Supplementary Table 2.

Analysis of potential genomic integration of plasmids

COLO320DM cells were seeded into 2 wells of a 6-well plate, transfected with 0.5 µg of pUC19 or pUC19_RE-C per well and passaged as described in the section ‘qPCR analysis of plasmid retention’. At day 8, high-molecular-mass gDNA was extracted from cells with a Puregene Cell Core kit (Qiagen, 158046) and long-read sequencing libraries were prepared using a Ligation Sequencing Kit v.14 (Oxford Nanopore Technologies, SQK-LSK114) in accordance with the manufacturer’s protocol. Libraries were loaded onto R10.4.1 flow cells (Oxford Nanopore Technologies, FLO-PRO114M) and sequenced on a PromethION platform (Oxford Nanopore Technologies). Basecalling from raw POD5 data was performed using the high accuracy DNA model in Dorado (Oxford Nanopore Technologies, v.0.5.2). Fastq files were generated using samtools bam2fq (v.1.6)⁷⁰, aligned to a custom reference (hg19_pUC19) comprising the pUC19 sequence appended to the hg19 genome using minimap2 (v.2.17)⁷¹ and sorted and indexed using samtools. Alignments shorter than 1 kb and with mapping quality below 60 were discarded. Structural variants were then called using Sniffles (v.2.2)⁷² with the hg19_pUC19 reference and the following parameters: “–allow-overwrite –output-rnames –non-germline –long-ins-length 3000”. Integration events were identified from Sniffles output (.vcf) as Breakends (Translocations) between the pUC19 sequence and chromosomes.

ENCODE data integration

To perform meta-analysis of protein-binding sites in retention elements, ENCODE data were downloaded in bigWig format using the files.txt file returned from the ENCODE portal (https://www.encodeproject.org) and the following command: “xargs -n 1 curl -O -L <files.txt”. Retention element coordinates in K562 cells were converted from the h19 build to the hg38 build using the UCSC LiftOver tool (R package liftOver, v.1.18.0). To plot heatmaps of protein binding in retention elements, we used the ‘computeMatrix’ function in deepTools (v.3.5.1) with the ‘scale-regions’ mode, specified each ‘bigWig’ file using “–scoreFileName”, and a.bed file containing hg38 retention element coordinates using “–regionsFileName”, along with the following parameters: “–regionBodyLength 5000 –beforeRegionStartLength 5000 –afterRegionStartLength 5000 –binSize 20 –skipZeros”. Each resulting matrix was aggregated by computing column means using the colMeans function in R and rescaled to 0–1 using the ‘rescale’ function in the scales (v.1.3.0) package in R.

To analyse overlap of various genomic annotation classes in retention elements, coordinates of each genomic annotation type were first obtained using the R packages TxDb.Hsapiens.UCSC.hg19.knownGene (genes; v.3.2.2) and TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts (lncRNAs; v.3.22). ‘All promoters’ comprised sequences 1,500 bp upstream to 200 bp downstream from the TSS for all transcripts in the TxDb objects, extracted using the ‘promoters’ function. 5′ UTR, 3′ UTR, intron and exon sequences were extracted using the ‘fiveUTRsByTranscript’, ‘threeUTRsByTranscript’, ‘intronicParts’ and ‘exonicParts’, functions, respectively, whereas coding and lncRNA promoters were each subsets of the total promoters list. Downstream intergenic regions represent nongenic sequences within 1,500 bp of each TTS, whereas distal intergenic regions were classified as nongenic sequences beyond 1,500 bp of the TSS and 1,500 bp of the TTS. Coordinates were computed using the ‘flank’ and ‘setdiff’ functions in the R package GenomicRanges (v.1.46.1).

To analyse enrichment of transcription-factor-binding sites in retention elements, uniformly processed transcription factor ChIP–seq data (aligned to the hg38 genome) from the K562 cell line were downloaded as a batch from the Cistrome Data Browser (Cistrome DB)⁷³. Datasets that failed to meet more than one of the following quality thresholds were excluded: raw sequence median quality score (FastQC score) ≥25; ratio of uniquely mapped reads ≥0.6; PBC score ≥80%; union DNase I hypersensitive site overlap of the 5,000 most significant peaks ≥70%; number of peaks with fold change above 10 ≥500; and fraction of reads in peaks ≥1%. Individual ChIP–seq datasets were imported as GenomicRanges (v.1.46.1) objects from narrowPeak or broadPeak files. For transcription factors with multiple ChIP–seq datasets, datasets were aggregated into a union peak set for subsequent analyses. To identify transcription factors that were enriched for binding in retention elements relative to random genomic intervals, a fold change value was computed for each transcription factor comparing the percentage of retention element intervals overlapping with at least one transcription factor ChIP–seq peak (>50% peak coverage) against the percentage of overlapping 1-kb genomic bins. P values were computed in R (function ‘phyper’) using hypergeometric tests for over-representation and adjusted for multiple comparisons with the Bonferroni correction.

Origins of replication overlap

Coordinates (in the hg19 reference) of origins of replication identified in the K562 cell line across five replicates of SNS-seq were published in another study⁷⁴ and deposited into the NCBI Gene Expression Omnibus (GEO) under accession GSE46189. Retention elements or 1-kb genomic bins were considered overlapping if an origin of replication covered at least 25% of the queried interval (calculated in R using the package GenomicRanges, v.1.46.1). The enrichment P value was computed in R using a hypergeometric test for over-representation.

GRO-seq analysis

GRO-seq data of COLO320DM were published in another study⁷⁵ and deposited into the NCBI GEO under accessions GSM7956899 (replicate 1) and GSM7956900 (replicate 2). The subset of retention element coordinates from the COLO320DM, GBM39 or K562 cell lines located in the amplified intervals of the COLO320DM ecDNA was divided into three categories on the basis of overlap with genomic annotations: (1) retention elements located entirely in coding gene promoters (within 2 kb of a coding gene TSS); (2) retention elements located elsewhere within the limits of coding genes; and (3) retention elements located in noncoding regions. Coordinates of these retention elements were then converted from the hg19 build to hg38 build using the UCSC liftOver package (v.1.18.0) in R. GRO-seq signals within 3 kb of the midpoint of each retention element were presented in separate heatmaps using the EnrichedHeatmap package (v.1.24.0) for each strand and for each retention element category.

Motif enrichment

A curated collection of human motifs from the CIS-BP database⁷⁶ (‘human_pwms_v2’ in the R package chromVARmotifs, v.0.2.0)⁷⁷ was first matched to the set of 1-kb bins spanning the hg19 reference to identify all such intervals of the human genome containing instances of each motif. Enrichment of each motif in retention elements was then calculated as a log₂[fold change] of the fraction of retention element intervals (identified by Retain-seq in each cell type) containing motif instances compared with all genomic intervals.

Live-cell imaging

The live-cell imaging cell line was engineered from COLO320DM cells obtained from the ATCC, as described in a previous publication⁶. TetO ecDNAs were labelled with TetR-mNeonGreen. On the basis of the overlap between MYC and TetO FISH foci in metaphase spreads, 50–80% of ecDNA molecules in a given cell were typically labelled (Extended Data Fig. 6a). The cells were further infected with the LacR-mScarlet-NLS construct and sorted for mScarlet-positive cells to enable stable expression of LacR-mScarlet protein. These cells were then subjected to nucleofection with one of the following plasmids: a control plasmid with LacO repeats; a plasmid containing a retention element (RE-G) with LacO repeats; or an in vitro CpG-methylated retention element (RE-G) plasmid with LacO repeats. Specifically, 1 μg of plasmid was nucleofected into 400,000 cells following the standard nucleofection protocol from Lonza (Nucleofection code, CM-138) to visualize plasmid signals. Cells were seeded onto 96-well glass-bottom plates (Azenta Life Sciences, MGB096-1-2-LG-L) (coated with 10 μg ml^–1 poly-d-lysine; Sigma-Adrich, A-003-E) immediately after nucleofection and were imaged 2 days later. FluoroBrite DMEM (Gibco, A1896701) supplemented with 10% FBS and 1× GlutaMAX, along with 1:200 Prolong Live antifade reagent (Invitrogen, P36975), was replenished 30 min before time-lapse imaging. Cells were imaged on a top-stage incubator (Okolab) fitted onto a Leica DMi8 wide-field microscope with a ×63 oil objective, and the temperature (37 °C), humidity and CO₂ (5%) were controlled throughout the imaging experiment. z stack images were acquired every 30 min for a total of 4–18 h. The images were processed using Small Volume Computational Clearing before maximum-intensity projections were made for all frames.

Live-cell imaging analysis

Maximum-intensity projections were exported as TIFF files from the .lif files using ImageJ. To analyse colocalization of LacR–LacO–plasmid foci or TetR–TetO–MYC ecDNA foci with mitotic chromosomes during anaphase, images of cells entering anaphase and telophase were exported for mitotic cells that had showed at least five distinct plasmid foci at the beginning of mitosis. The exported images were split into the different colour channels, and the signal threshold was manually set to remove background fluorescence using Fiji (v.2.1.0/1.53c)⁶¹. Fluorescence signals were segmented using watershed segmentation. The H2B-emiRFP670 signal was used to mark the boundaries of mitotic chromosomes of dividing daughter cells. All colour channels except H2B were stacked, and regions of interest (ROIs) were manually drawn to identify the two daughter cells, and a third ROI was drawn around the space occupied by the pair of dividing daughter cells. Next, the colour channels were split again and image pixel areas occupied by fluorescence signals were analysed using particle analysis. Fractions of ecDNAs colocalizing with mitotic chromosomes were estimated by fractions of FISH pixels in the ROIs of daughter cell chromosome.

To perform time-resolved DNA segregation analysis, TIFF files were analysed using Aivia (v.12.0.0) by first segmenting the condensed chromatin (labelled by H2B- emiRFP670), TetR–TetO–MYC foci and LacR–LacO–plasmid foci of the mitotic cell, using a trained pixel classifier that recognizes each of the elements. Each segmented chromatin and focus of interest was then manually selected and output as an object. The relative distance of each focus to its corresponding periphery of the segmented chromatin was output using the Object Relation Tool by setting the ‘TetR/PVT1’ object as the primary set and its corresponding ‘Chromatin’ object as the secondary set using default settings. The resulting data were exported to R (v.3.6.1). TetR–TetO–MYC foci or LacR–LacO–plasmid foci with more than 75% overlapping area with the ‘Chromatin’ object were considered colocalized, and their relative distances to their corresponding segmented chromatin were replaced with 0. For each dividing cell, the fractions of plasmid or ecDNA foci colocalizing with mitotic chromosomes were calculated.

Hi-C

For mitotic Hi-C of COLO320DM cells, cells were seeded into a 6 cm dish at a density of 0.5 × 10⁶ cells in 8 ml RPMI medium (11875-119) containing 10% fetal bovine serum (Fisher Scientific, SH30396.03) and 1% penicillin–streptomycin (Gibco, 15140-122) and the cells were incubated overnight. Nocodazole (M1404-10MG) was dissolved in DMSO and added directly to the cells in the medium to reach a final concentration of 100 ng μl^–1 (8 μl of 100 ng ml^–1 nocodazole was added to 8 ml RPMI medium). After 16 h of nocodazole treatment, both suspension and adherent cells were collected for Hi-C analysis and flow cytometry analysis for cell cycle staining using propidium iodide (Invitrogen, 00699050). Flow cytometry verified that the cell population consisted mainly of cells with 4n DNA content after mitotic arrest. For interphase Hi-C of GBM39 (GBM39ec) cells, GBM39 cells were cultured as described above (section ‘Cell culture’).

To perform each Hi-C experiment, 10 million cells were fixed in 1% formaldehyde in aliquots of 1 million cells each for 10 min at room temperature and combined after fixation. We performed the Hi-C assay following a standard protocol to investigate chromatin interactions⁷⁸. Hi-C libraries were sequenced on an Illumina HiSeq 4000 with paired-end 75 bp reads for mitotic Hi-C of COLO320DM cells and an Illumina NovaSeq 6000 with paired-end 150 bp reads for interphase Hi-C of GBM39 cells⁷⁹.

Hi-C analysis

Paired-end Hi-C reads were aligned to hg19 genome with the Hi-C- Pro pipeline⁸⁰. The pipeline was set to default and set to assign reads to DpnII restriction fragments and filter for valid pairs. The data were then binned to generate raw contact maps, which then underwent ICE normalization to remove biases. Visualization was done using Juicebox (https://aidenlab.org/juicebox/). Hi-C data from asynchronous COLO320DM and GBM39 cells were generated and processed in the same way in parallel with the mitotically arrested cells. Asynchronous COLO320DM cell data were separately published⁸¹ and deposited into the NCBI GEO under accessions GSM8523315 (replicate 1) and GSM8523316 (replicate 2).

To analyse chromatin interactions with retention elements on ecMYC, the combined set of retention elements identified was overlapped with the known ecMYC coordinates: chromosome 8, 127,437,980–129,010,086 (hg19). To analyse chromatin interactions with chromosome bookmarked regions, we used previously identified bookmarked regions that retained accessible chromatin throughout mitosis in single-cell ATAC–seq data of L02 human liver cells³⁷ and filtered out regions that overlap with the known ecMYC coordinates and other ecMYC co-amplified regions: chromosome 6, 247,500–382,470; chromosome 8, 130,278,158–130,286,750; chromosome 13, 28,381,813–28,554,499; chromosome 16, 32,240,836–32,471,322; and chromosome 16, 33,220,985–33,538,549. The resulting ecMYC retention elements and chromosome bookmarked regions were used as anchors to measure pairwise interactions using APA with the .hic files in Juicer (v.1.22.01) and the ‘apa’ function with 5-kb resolution and the following parameters: “-e -u”. Summed percentile matrices of pairwise interactions from ‘rankAPA.txt’ are reported. Analyses for the EGFR ecDNA in the GBM39 cell line were performed in the same manner, using the ecDNA coordinates chromosome 7, 54,830,901–56,117,000 (hg19).

To analyse interactions between ENCODE-annotated classes of regulatory sequences, the retention elements that overlapped with ‘dELS’, ‘PLS’ or ‘pELS’ annotations were categorized as distal enhancers, promoters or proximal enhancers, respectively. Those overlapping with both pELS and PLS annotations were categorized as promoters, whereas those overlapping with both pELS or dELS annotations were categorized as proximal enhancers. To extract Hi-C read counts corresponding to interactions between different classes of elements on ecDNA and chromosomes, the Juicer Tools (v.1.22.01)⁸² dump command was used to extract read count data from the .hic files with 1-kb and 5-kb resolution with ‘observed NONE’. The resulting outputs were converted into GInteractions objects using the InteractionSet (v.1.14.0) package in R. To remove chromosomal regions with increased signal due to copy-number changes (and not occurring on ecDNA), we filtered out chromosomal regions that overlapped with copy-number-gain regions identified in WGS of COLO320DM using the ReadDepth (v.0.9.8.5) package. GInteractions objects containing Hi-C read counts between genomic coordinates in 1-kb resolution were overlapped with a GInteractions object containing pairwise interactions between chromosome bookmarked regions and ecMYC retention elements using the findOverlaps function in the InteractionSet package in R. Resulting read counts of these pairwise interactions were used to calculate read counts per kb using the formula: read counts per kb = 1,000 × read counts/size of retention element bin in bp. Read counts per kb of each combination of interactions between different classes of elements were summed and divided by the total number of pairwise interactions belonging to each combination of interactions to obtain read counts per kb per interaction.

Curation of candidate bookmarking factors

Candidate bookmarking factors were curated from three recently published studies^37,39,83. Candidate bookmarking factors identified in ref. ³⁹ were identified in mouse cells. Their orthologues were identified using the Mouse Genome Informatics database (http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt), and those not annotated as ‘Depleted’ on mitotic chromosomes were included. Candidate bookmarking factors identified in ref. ³⁷ were identified on the basis of single-cell ATAC–seq analysis of mitotic chromosomes. Finally, candidate bookmarking factors identified in ref. ⁸³ were selected by focusing on protein factors that met the following criterion: log₂[(C + 1)/(P + 1)] > 0, where C denotes the mean protein enrichment values in mitotic cells from fractionated chromatin (chromatome), and P denotes the mean protein enrichment values in the proteomes of mitotic cells.

Importance analysis of bookmarking factors

To interrogate whether retention elements contain disproportionately more binding sites of some bookmarking factors than others, we computed importance scores in R for each bookmarking factor to explain the observed set of retention elements. First, we generated 1,000 random permutations of the top 20 most enriched bookmarking factors in retention elements compared with random intervals. For each permuted list, we computed the incremental number of retention elements explained by (containing binding sites of) each bookmarking factor in the cumulative distribution. The mean of this value across all permutations represents the importance score for each bookmarking factor.

CRISPR–Cas9 knockouts of bookmarking factors

Cas9–gRNA ribonucleoprotein (RNP) complexes were first assembled for each gRNA by mixing 30 µM gRNAs (Synthego) targeting CHD1, SMARCE1 and HEY1 and 2 nontargeting control gRNAs (2 separate guides per target; guide sequences are provided in Supplementary Table 1) separately with 20 µM SpCas9 2NLS Nuclease (Synthego) at a 6:1 molar ratio. Complexes were then incubated for 10 min at room temperature. In brief, COLO320DM cells were counted, centrifuged at 300g for 5 min and washed twice with PBS before resuspension in Neon resuspension buffer to a density of 4.2 × 10⁵ in 7 µl of buffer. Next, 7 µl of cell suspension and 7 µl of RNP were mixed and electroporated per reaction according to the manufacturer’s instructions using a 10 µl Neon pipette tip under the following settings: 1,700 V, 20 ms, 1 pulse. Three electroporation reactions were plated for each replicate (2 per condition) into 6-well plates in 3 ml of medium per well.

IF–DNA-FISH of knockout mitotic cells

About 1 million cells were seeded onto 22 × 22 cm poly-d-lysine-coated coverslips 2 days after transfection. The next day, cells were washed once with 1× PBS and fixed with 4% paraformaldehyde for 10 min at room temperature, followed by permeabilization with 1× PBS–0.25% Triton-X for 10 min at room temperature. Samples were blocked in 3% BSA diluted in 1× PBS for 1 h at room temperature, followed by an overnight incubation at 4 °C with the following primary antibodies: Aurora kinase B antibody (Novus Biologicals, NBP2-50039; 1:1,000); CHD1 (Novus Biologicals, NBP2-14478; 1 μg ml^–1); HEY1 (Novus Biologicals, NBP2-16818; 1:1,000); and SMARCE1 (Sigma-Aldrich, HPA003916; 1 μg ml^–1). Cells were washed in 1× PBS and incubated with fluorescently conjugated secondary antibodies (F(ab′)2-goat anti-rabbit IgG (H+L) cross-adsorbed secondary antibody, Alexa Fluor 488 (Invitrogen, A-11070), donkey anti-mouse IgG (H+L) highly cross-adsorbed secondary antibody and Alexa Fluor 647 (Invitrogen, A-31571) at 1:500 for 1 h at room temperature. The samples were then washed in 1× PBS and fixed with 4% paraformaldehyde at room temperature for 20 min. A subsequent permeabilization step using 1× PBS containing 0.7% Triton-X and 0.1 M HCl was performed on ice for 10 min, followed by acid denaturation for 30 min at room temperature using 1.9 M HCl. The samples were then washed once with 1× PBS and then 2× SSC, followed by washes with an ascending ethanol concentration of 70%, 85% and 100% for 2 min each. MYC FISH probes (Empire Genomics) were diluted with hybridization buffer and subjected to heat denaturation at 75 °C for 3 mins before applying onto the fully air-dried coverslips for overnight hybridization at 37 °C. The next day, the coverslips were washed once with 0.4× SSC, then with 2× SSC-0.1% Tween 20 and counterstained with DAPI at 50 ng ml^–1 for 2 min at room temperature. After rinsing in ddH₂O, the samples were air-dried and mounted onto frosted glass slides with ProLong Diamond antifade mountant (Invitrogen). Samples were imaged on a Leica DMi8 wide-field microscope. z stack images were collected and subjected to small volume computational clearing on LAS X.

Analysis of IF–DNA-FISH of knockout mitotic cells

We first created a CellProfiler (v.4.2.7)⁸⁴ analysis pipeline to quantify protein expression levels after targeted knockdown. In brief, we split each image into four colour channels (DAPI, Aurora kinase B, target protein and ecDNA FISH), and used DAPI to segment nuclei (40–150 pixel units) with global Otsu’s thresholding (two-class thresholding). We then identified cells by starting from the nuclei as seed regions and growing outward using the protein staining signals via propagation with global minimum cross-entropy thresholding. The mean intensity of protein staining in cells was used to determine knockout efficiency of target proteins compared with controls.

Next, we created a CellProfiler analysis pipeline to quantify ecDNA tethering to mitotic chromosomes after protein knockout. In brief, we identified mitotic daughter cell pairs using pairs of cells with Aurora kinase B marking the mitotic midbody as previously described⁶. We segmented nuclei using DAPI as described above and then identified cells by starting from the nuclei as seed regions and growing outward using the protein staining signals via propagation with three-class global Otsu’s thresholding (with pixels in the middle intensity class assigned to the foreground). We separately identified ecDNA foci as primary objects using adaptive Otsu’s thresholding (two-class) and intensity-based declumping. Masks were then created for ecDNA foci overlapping with nuclei (with at least 30% overlap) and ecDNA foci overlapping with cytoplasm (with at least 70% overlap) and defined them as tethered and untethering ecDNA, respectively. The sum of pixel areas was calculated for each group of ecDNA foci and used to calculate tethered ecDNA fractions.

Evolutionary modelling of ecDNAs

To simulate the effect of retention and selection on ecDNA copy number in growing cell populations, we implemented a new forward-time simulation in Cassiopeia⁸⁵ (https://github.com/YosefLab/Cassiopeia). The simulation framework builds on a previously described forward-time evolutionary model⁶. Specifically, each simulation tracked a the copy-number trajectory of a single ecDNA and was initially parameterized using the following factors: (1) initial ecDNA copy number (denoted as k_init); (2) selection coefficients for cells with no ecDNA (s₀) or at least one copy of ecDNA (s₁); (3) a base birth rate (λ_base = 0.5); (4) a death rate (µ = 0.33); and (5) a retention rate ($\nu \in [0,1]$) that controls the efficiency of passing ecDNA on from generation to generation.

Starting with the parent cell, a birth rate is defined on the basis of the selection coefficient acting on the cell (s = s_I or s_I, depending on its ecDNA content) as λ₁ = λ_base × (1 + s). Then, a waiting time to a cell division event is drawn from an exponential distribution: t_b ∼ exp (–λ₁). Simultaneously, a time to a death event is also drawn from an exponential distribution: t_d ∼ exp (–µ). If t_b< t_d, a cell division event is simulated and a new edge is added to the growing phylogeny with edge length t_b; otherwise, the cell dies and the lineage is stopped. We repeated this process until 25 time units were simulated and at least 1,000 cells were present in the final population.

During cell division, ecDNAs are split among daughter cells according to the retention rate, v, and the ecDNA copy numbers of the parent cell. Following previous observations of ecDNA inheritance⁵, ecDNA is divided into daughter cells according to a random binomial process after considering the number of copies of ecDNA that are retained during mitosis. Specifically, with n_i being the number of ecDNA copies in daughter cell $i$ and N being the number of copies in the parental cell:

$${n}_{1}=\mathrm{Binomial}(2{N}_{{v}},\,0.5)$$

$${n}_{2}=2{N}_{{V}}-{n}_{1}$$

where Binomial is the binomial probability distribution.

In our experiments, we simulated populations over 25 simulated time units of at least 1,000 cells across ecDNA selection coefficients ${s}_{1}\in [0,0.8]$ (where s₁ = 0 indicates no selective advantage for cells with ecDNA) and ecDNA retention rates $\nu \in \{0.5,\,0.6,\,0.7,\,0.8,\,0.9,\,$$0.95,\,0.97,\,0.98,\,0.99,\,1.0\}.$ Selection on cells with no ecDNA was kept at s₀ = 0. We simulated ten replicates per parameter combination and assessed the mean copy number and frequency of ecDNA-positive cells for each time step.

Analysis of ecDNA sequences in patient tumours

Focal amplification calls predicted by AmpliconArchitect⁸⁶ from tumour samples in The Cancer Genome Atlas and the Pan-cancer Analysis of Whole Genomes cohorts were downloaded from the AmpliconRepository (https://ampliconrepository.org)⁸⁷. A dataset was constructed for ecDNA, BFB and linear amplicons containing the following information for every amplified genomic interval in each amplicon: the corresponding sample, the amplicon number (in that sample), the amplicon ID (assigned in AmpliconRepository), the amplicon classification (ecDNA, BFB or linear), the chromosome, the start and end coordinates, the width, the number of overlapping retention elements and the overlapping oncogenes.

Local retention element density was also computed in R for each amplified interval by dividing the number of retention elements found within 2.5 Mb of the midpoint of the interval by the local window width (5 Mb). Local retention element density was calculated for each amplicon as an average of the local densities of the intervals, weighted by the interval width.

To analyse co-amplification of retention element-negative intervals with retention element-positive intervals, all amplified intervals that lacked retention elements were first identified. If the amplicon corresponding to a given interval contained other intervals with retention elements, then the amplicon was considered co-amplified. Each amplicon was only counted once, regardless of the number of co-amplified retention element-negative intervals. The percentage of amplicons with a co-amplification event was computed for each amplicon class, and P values were calculated between classes using a one-sided test of equal proportions.

Predicted ecDNA amplicon intervals containing EGFR and CDK4, the two most frequently amplified oncogenes in AmpliconRepository samples, were analysed for co-amplification of oncogenes with retention elements. For each oncogene-containing ecDNA interval, 100 random oncogene-containing intervals of the same width were simulated by varying the starting point of the amplified region. For each retention element located within 500 kb of the midpoint of the genomic coordinates of the oncogene, the frequency of inclusion of that retention element in observed oncogene-containing ecDNA intervals was compared with the expected frequency based on the random intervals. Enrichment was computed as a fold change of the observed frequency compared with the expected frequency. P values comparing the distributions were calculated in R using a two-sided Fisher’s exact test and adjusted for multiple comparisons with the Benjamini–Hochberg method.

DNA methylation analysis in nanopore sequencing data

Nanopore sequencing data of GBM39 cells were published in another study⁸⁸ and deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA1110283. Bases were called from fast5 files using guppy (Oxford Nanopore Technologies, v.5.0.16) in Megalodon (v.2.3.3), and DNA methylation status was determined using Rerio basecalling models with the configuration file ‘res_dna_r941_prom_modbases_5mC_v001.cfg’ and the following parameters: “–outputs basecalls mappings mod_mappings mods per_read_mods –mod-motif m CG 0 –write-mods-text –mod-output-formats bedmethyl wiggle –mod-map-emulate-bisulfite –mod-map-base-conv C T –mod-map-base-conv Z C”. In downstream analyses, methylation status was computed over 1-kb intervals for retention elements and other matched-size intervals in the EGFR ecDNA.

CRISPRoff

CRISPRoff experiments were performed as described previously⁵¹, but with modification. In brief, we first cloned a plasmid (cargo plasmid) that simultaneously expresses five guides targeting the five unmethylated retention element sequences found on the EGFR ecDNA of the GBM39 cell line under U6 promoters in an array format using a previously described CARGO approach⁸⁹ (guide sequences are provided in Supplementary Table 1). We also cloned a second plasmid (NTC plasmid) containing only a single LacZ–targeting guide, with expression also driven by a U6 promoter, as a nontargeting control. The cargo plasmid or the NTC plasmid was co-transfected with the CRISPRoff-v.2.1 plasmid (Addgene, 167981) into 1.5 × 10⁷ GBM39 cells using the Neon transfection system in accordance with the manufacturer’s protocols. In brief, cells were dissociated to a single-cell suspension with 0.5× TrypLE, counted, centrifuged at 300g for 5 min and washed twice with PBS before resuspension in Neon resuspension buffer to a density of 4.2 × 10⁶ in 70 µl of buffer; 14 µg CRISPRoff-v2.1 and 7 µg cargo or NTC plasmids were also mixed with Neon resuspension buffer to a total volume of 70 µl. Next, 70 µl of cell suspension and 70 µl of plasmids were mixed and electroporated according to the manufacturer’s instructions using a 100 µl Neon pipette tip under the following settings: 1,250 V, 25 ms, 2 pulses. Five electroporation reactions were pooled per replicate of each condition and cultured in T75 flasks. Cells were further cultured for 2 days, and double-positive cells (mCherry from the cargo plasmid and BFP from CRISPRoff-v2.1, or eGFP from the NTC plasmid and BFP from CRISPRoff-v2.1) were sorted using a BD Aria II instrument. The sorted cells were immediately plated on laminin-coated coverslips in a 24-well plate at a density of 1 × 10⁵ in 450 µl medium in preparation for imaging (see the section ‘CRISPRoff imaging’). The remaining sorted cells were cultured for an additional 3 days and collected for gDNA extraction using a DNeasy Blood & Tissue kit (Qiagen, 69504). ecDNA levels were quantified by WGS (see the section ‘WGS’).

Imaging validation of CRISPRoff

Two days after sorting, a total of 100,000 cells were seeded onto laminin (10 µg ml^–1)-coated 12 mm circular coverslips for each transfection condition. Cells were allowed to recover for another 24 h. Cells were washed once with PBS and fixed with 4% paraformaldehyde at room temperature for 10 min, followed by permeabilization with 1× PBS containing 0.5% Triton-X for another 10 min at room temperature. To further enhance fixation and permeabilization, three additional washes with Carnoy’s fixative (3:1 methanol and glacial acetic acid) were performed. The samples were then rinsed briefly with 2× SSC buffer and subjected to dehydration with ascending ethanol concentrations of 70%, 85% and 100%. The coverslips were completely air-dried before the application of a FISH probe mixture (Empire Genomics), which comprised 0.25 µl EGFR FISH probe and 4 µl hybridization buffer. The samples were denatured at 75 °C for 3 min and then hybridized overnight at 37 °C in a humidified, dark chamber. Following hybridization, the coverslips were transferred to a 24-well plate and washed once with 0.4× SSC, then 2× SSC 0.1% Tween-20 and then 2× SSC, for 2 min each. DAPI (5 ng ml^–1) was applied to the samples for 2 min to counterstain nuclei. The samples were then washed with 2× SSC and ddH₂O before air drying and then mounted with ProLong Diamond. The samples were imaged on a Leica DMi8 wide-field microscope using a ×63 oil objective lens. z stacks were acquired (total range = 10 µm, step size of 0.27 µm, 38 steps) and subjected to small volume computational clearing on LAS X software. ImageJ was used to generate maximum-intensity projections for image analysis to quantify total EGFR FISH copy number per nucleus.

To quantify total EGFR FISH copy number per nucleus, deep-learning-based pixel classifiers were trained on the DAPI and EGFR FISH channels to create a smart segmentation and confidence mask, respectively, using Aivia Software (Leica Microsystems). The masks were used to create a protocol to segment FISH foci and assign FISH foci to their corresponding nucleus. The following measurements were exported for quantification: area, circularity and cell ID for nuclei; area and cell ID for FISH foci. Dead cells and mis-segmented cells with a measurement in nuclei with areas >200 and <75, and circularities <0.7, were excluded from the analysis. The number of cells with untethered FISH foci (that is, FISH foci that were not in the nucleus boundaries in viable cells) were manually counted from each transfection condition.

WGS

WGS libraries were prepared by DNA tagmentation as previously described⁶. We first transposed gDNA from sorted CRISPRoff cells with Tn5 transposase produced as previously described⁶² in a 50-µl reaction with TD buffer⁶³, 10 ng DNA and 1 µl transposase. The reaction was performed at 50 °C for 5 min, and transposed DNA was purified using a MinElute PCR Purification kit (Qiagen, 28006). Libraries were generated through 7 rounds of PCR amplification using NEBNext High-Fidelity 2× PCR master mix (NEB, M0541L) with primers bearing i5 and i7 indices, purified using a SPRIselect reagent kit (Beckman Coulter, B23317) with double-sided size selection (0.8× right, 1.2× left), quantified using an Agilent Bioanalyzer 2100, diluted to 4 nM and sequenced on an Illumina Nextseq 550. Adapter content was trimmed from reads using Trimmomatic⁶⁴ (v.0.39), aligned to the hg19 genome using BWA MEM (v.0.7.17-r1188)⁶⁵, and PCR duplicates removed using MarkDuplicates in Picard (v.2.25.3).

Plasmid in vitro methylation

To measure the effects of CpG methylation on retention element activity on a plasmid, we performed in vitro methylation of plasmids using M.SssI (NEB, M0226M) for 4 h at 37 °C. Plasmids were then extracted using phenol–chloroform and precipitated using ethanol. Purified plasmids were transfected into cells and assayed using qPCR or live-cell imaging as described above in the sections ‘qPCR analysis of plasmid retention’ and ‘Live-cell imaging’, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.