Long-read metagenomics reveals phage dynamics in the human gut microbiome

November 27, 2025

4

Study population and ethics statement

For time point 1, stool samples were collected from six adult volunteers living in the Bay Area, CA, USA. This study involving humans was approved by our institutional internal review board (Stanford IRB 42043; principal investigator: A.S.B.), and informed consent was obtained from all participants. For time point 2, the same individuals were recontacted for a second stool donation, 2 years after the initial one.

The samples for time point 1 were included in the publication of Maghini et al.⁶², and short-read metagenomic sequencing reads are available at the National Center for Biotechnology Information’s Sequence Read Archive under the identifier PRJNA940499.

Sample collection and processing

Stool samples were collected without a preservative and stored at −80 °C. All DNA extractions were performed using the QIAamp PowerFecal Pro DNA Kit (QIAGEN; 51804) according to the manufacturer’s instructions, with the exception of using the EZ-Vac Vacuum Manifold (Zymo Research) instead of centrifugation. DNA concentration was measured using a Qubit 3.0 Fluorometer (Thermo Fisher Scientific) with the dsDNA High Sensitivity kit.

Metagenomic short-read sequencing

The samples from T1 had already been sequenced; therefore, we generated new libraries only for samples from T2. Metagenomic sequencing libraries were pooled, and 2 × 150 bp reads were generated using the NovaSeq 6000 platform (Illumina; 20012850) to a final depth of 6 Gb per sample.

Metagenomic long-read sequencing

All samples from both time points underwent long-read metagenomic sequencing using the ONT platform. DNA fragment distribution was assessed using a TapeStation (Agilent; G2992AA). Samples with apparent fragmentation were cleaned up using a bead-based protocol before library preparation⁶³.

Libraries were prepared using the Native Barcoding Kit V24 (ONT; SQK-NBD114.24) using 1,000 ng of DNA as input. In the pooling step, four samples were combined, resulting in three total libraries. The libraries were loaded onto PromethION R10.4.1 flow cells (ONT; FLO-PRO114M) and sequenced until exhaustion of the flow cells.

Short-read data processing

For short-read sequencing, all raw reads were processed with our in-house NextFlow pipeline (v.22.10.5; ref. ⁶⁴; https://github.com/bhattlab/bhattlab_workflows_nf). In short, reads were deduplicated using HTStream SuperDeduper (v.1.3.3), and low-quality bases were trimmed using TrimGalore (v.0.6.7). Reads were then mapped against the human genome (hg38) using bwa (v.0.7.17 31), and all matching reads were discarded. For comparability, T1 samples were downsampled to final library sizes randomly drawn from the distribution of T2 library sizes after preprocessing.

For each sample, metagenomic assembly was performed using MetaHIT (v.1.2.9), and genes were predicted using Bakta (v.1.8.2). Assemblies were binned into draft genomes using MetaBAT (v.2.5), CONCOCT (v.1.1.0) and MaxBin (v.2.2.7), followed by bin consolidation using DAS Tool (v.1.1.6). Bin quality was assessed using CheckM (v.1.2.2), and taxonomic classification was performed using GTDB-Tk (v.2.3.0) using the Genome Taxonomy Database (GTDB) r214.

Long-read data processing

For long-read sequencing, POD5 files were basecalled and de-multiplexed using Dorado (v.0.5.3) using the ‘super-high accuracy’ model (v.3.4.0) to create the final set of FASTQ files. Read quality and length distribution were assessed using NanoPlot (v.1.41.6) before and after the removal of human reads (read mapping against the human genome (v.38) using minimap2 (v.2.26-r1175)). Metagenomic assembly was performed using metaFlye (v.2.9.2-b1786) using the nano-hq flag to use only reads of quality Q20 or higher for initial assembly. Binning and taxonomic classification of bins were performed as described for short reads. The workflows for long-read data processing are also available at https://github.com/bhattlab/bhattlab_workflows_nf.

To better compare the short and long reads, the long reads were subsampled to the same mean depth as the short reads using a custom script, which randomly selected read IDs from the processed reads up to a specified amount of total sequencing. Seqtk subseq (v.1.4-r130) was used to subsample the original FASTQ files for the selected read IDs. Read quality and length distribution were assessed with NanoPlot (v.1.41.6). Reads were assembled and binned using the same workflow described above.

Phage prediction

To predict phages, we applied geNomad (v.1.7.6; ref. ²³), VIBRANT (v.1.2.1; ref. ²¹), VirSorter2 (v.2.2.4; ref. ²²) and Cenote-Taker3 (v.3.4.0; ref. ²⁴) to the short-read and long-read assemblies for each metagenomic sample. For each predicted phage, we used CheckV (v.1.0.1; ref. ³⁴) to assess the quality and to adjust the boundary predictions on the basis of CheckV host trimming. VIBRANT, geNomad, Cenote-Taker3 and CheckV are included in the NextFlow project available at https://github.com/bhattlab/bhattlab_workflows_nf. VirSorter2 was run separately because it relies on Snakemake (v.5.26.0) for execution. All predicted phages, including full phage contigs and predicted prophages, were collated from the four tools.

Comparison across short-read and long-read sequencing

To compare the phage predictions across short-read and long-read sequencing, we mapped short-read contigs against the (subsampled) long-read assembly using blast+ (v.2.2.31), filtering alignments for identity (99%) and query coverage (90%). To determine if an integrated long-read phage was fully covered by short-read contigs, we required a single short-read contig to align to at least 95% of the phage region. To determine the short-read coverage of integrated phages, we mapped the short reads to the long-read assembly using Bowtie 2 (v.2.5.4) and calculated the per-base coverage using SAMtools (v.1.21).

To compare short-read and long-read phage predictions in more detail, we adapted the segment overlap metric (originally developed to measure overlaps between predicted biosynthetic gene clusters⁶⁵) to measure which fraction of long-read phages was covered by the predicted short-read phages and vice versa (Extended Data Fig. 4). This metric calculates recall by considering each long-read phage prediction as a positive instance. A long-read phage was considered a true positive if covered (up to a variable cutoff of x%) by alignments of one or more short-read phage contigs; otherwise, it was classified as a false negative. Recall, defined as true positives over the sum of all positives, quantifies the fraction of long-read phages that are found by short-read sequencing. Precision was defined on the basis of the short-read phages; those overlapping (to at least x%) a long-read phage were considered true positives, whereas those not overlapping were considered false positives. Precision was then calculated as true positives over the sum of true and false positives, quantifying the fraction of short-read phages found by long-read sequencing.

Clustering of phages across time points

Prophages were clustered across time points within each individual by mapping all phages of T1 and T2 against each other using blast+ (v.2.2.31), and genome identity and coverage were computed using the CheckV companion scripts³⁴. To cluster the host regions, we extracted 20-kb regions on either side of integrated prophages, concatenated these phage-surrounding regions and performed the same clustering analysis with the CheckV companion scripts. Additionally, we mapped the original phage region and their host regions against the assembly of the other time point to prevent erroneous classification of dynamic phages not annotated in the other time point. Finally, we quantified the median read coverage in both time points for all phage regions and filtered out all phages with a median coverage of less than ten in both time points.

Using these clustering results, we iteratively consolidated the clustering in the following way (Extended Data Fig. 5): phages with high identity (98%) and genome coverage (90%) were clustered together. If the hosts clustered together as well, they were classified as stably integrated; otherwise, they were classified as phages found in several host contexts. If two phages failed to cluster together but their host regions did, we relaxed the cutoff for genome coverage to classify them as stably integrated because we observed large SVs between phages integrated into the same host context. For all singleton clusters, we checked whether the host region or the phage itself had a high-identity and high-coverage mapping to the assembly of the other time point. Because we observed many false-positive dynamic phages at contig edges, we discounted singleton phages found at the edges of contigs. If a phage and its host region mapped well to the same contig in the other time point, we classified this phage as stably integrated; otherwise, we classified them as either dynamic phages or singletons lost/gained with their host if we found the host region in the assembly of the other time point.

Strain replacement analysis

To evaluate the relationship between paired MAGs within individuals, we identified all high-quality and high-coverage bins that shared the same taxonomic classification on the basis of GTDB-Tk (v.2.3.0) across time points. High-coverage bins were quantified by calculating the median contig coverage on the basis of metaFlye (v.2.9.2-b1786) output within each bin. The shared ANI of paired bins was calculated using FastANI (v.1.34; ref. ⁶⁶) and default one-to-one parameters. Mean proportion of strain replacement across individuals was then calculated by setting a minimum shared ANI for the same strain and determining the number of paired MAGs that fall above this value per individual. This was done using an ANI cutoff of all unique ANI values to calculate the strain replacement for different ANI cutoffs.

Identification of SVs overlapping predicted phage regions

To find SVs in our data, we mapped the reads of each time point against the assembly of the other time point within an individual using NGMLR (v.0.2.7; ref. ⁶⁷). Binary Alignment Map files were sorted with SAMtools (v.1.9), and structural variants were identified using Sniffles2 (v.2.2; ref. ³⁶).

Assignment of host taxonomy for integrated prophages

We used iPHoP (v.1.3.3; ref. ³⁵) to predict hosts for all phages using default parameters and database version iPHoP_db_Aug23_rw. For integrated long-read phages, we annotated their hosts by classifying each annotated gene on a given contig using the mmseqs taxonomy module from MMseqs2 (v.14.7e284; ref. ⁶⁸). This module provides a taxonomic annotation on the basis of the GTDB (v.214.1; ref. ⁶⁹) database for each gene. For each contig, we then combined the annotations for all genes not located in predicted phage regions at each taxonomic level. Annotations were accepted if more than 50% of genes agreed, disregarding genes without taxonomic annotation.

The iPHoP, MMseqs2-based and binning taxonomic assignments were evaluated for consensus at each taxonomic level (Extended Data Fig. 4). To determine a subset of integrated phages for which we had high-confidence host assignment, we filtered our results for agreement between high-quality bins (more than 90% completeness and less than 5% contamination; ref. ⁷⁰) and MMseqs2-based assignment down to the family level.

Clustering of phages on species level for host-range analysis

To evaluate host range within phage species, we clustered all geNomad-annotated phages in all samples at a minimum of 95% ANI and 80% alignment fraction using CheckV (v.1.0.1; ref. ³⁴) supporting scripts. We specified a minimum of 80% query and target coverage (–min_tcov 80 –min_qcov 80) because we expect phages to be assembled more contiguously by long reads. The resulting cluster membership information was merged with the host annotations from our MMseqs2 taxonomy approach and bin taxonomy, as described above.

To visualize phages found in several bacterial families, we compared bins using AliTV⁷¹, filtering alignments by length and nucleotide identity. Additionally, we calculated their read coverage (removing reads less than 500 bp in length and with more than 1% mismatches to prevent spurious mappings) and the number of read alignment ends per genomic position. A high number of read alignment ends could point towards potential mis-assemblies, as explored in a previous study⁴⁰.

To gain orthogonal evidence for the integration of the same phage species into distinct bacterial families, we also assembled our data using myloasm (v.0.1.0; ref. ⁴¹), an alternative assembler to Flye. We used standard parameters, except for –min-reads-contig 3. For all putative phage species in distinct bacterial families, we identified the corresponding myloasm contig and tried to verify the prophage integration (Supplementary Table 2). In the case of two clusters, the phages were found at the edges of contigs, but their integration into the bacterial species could be verified by the broader host context derived from the myloasm assembly (Extended Data Fig. 8).

Gene content for integrated prophages

To assess gene content across all integrated prophages with structural variant evidence, we used pharokka (v.1.7.3) against the pharokka (v.1.4.0) database⁷² with the –meta, –skip-mash and –split flags.

Synteny and gene content visualization

Visualization of all gene content and synteny was done using LoVis4u (v.0.1.4.1; ref. ⁷³). Two of the 22 IScream phages were smaller than 10 kb and therefore were removed from this analysis. For IScream phage synteny visualization, gff files of the identified full-length IScream phages, generated by pharokka annotation (described above), were used as input for LoVis4u visualization with default configuration. For host context visualization, gff files from Bakta were parsed to visualize specified windows and to convert them to a format compatible with LoVis4u. The reformatted gff3 files were used as input for LoVis4u visualization using an updated configuration file specifying mmseqs_min_seq_id = 0.8.

Identification of potential integrase enzymes

To identify potential integrase enzymes in our assemblies, we used a set of Pfam hidden Markov models described in an earlier exploration of mobile genetic elements⁷⁴: PFPF07508 for large serine recombinases; PF00239 for small serine recombinases; PF00589 for tyrosine recombinases; and PF00665, PF13333 and PF13683 for DDE recombinases. We used the hmmsearch command (with the —cut_ga flag) from HMMER (v.3.4; ref. ⁷⁵) against all predicted proteins. We then annotated each phage by counting which type of potential integrase was present within the phage boundaries using only phages with evidence from SVs.

Identification of IScream phages in MGV and Clostridia genomes

To explore the prevalence and abundance of IScream phages, we searched for potential IScream phages in public datasets. We ran ISEScan (v.1.7.2.3; ref. ⁷⁶) on all viral genome assemblies from MGV and identified genomes that contained one IS30 element starting less than 1 kb from the beginning of the assembly and one IS30 element ending less than 1 kb from the end of the assembly.

We next searched for existing bacterial isolates containing integrated IScream phages. We downloaded all 2,160 class Clostridia genome assemblies annotated as complete or chromosomal level using the NCBI Datasets tool. We ran ISEScan and geNomad on all assemblies and identified genomes that contained a geNomad-annotated prophage region with one IS30 element starting less than 1 kb from the beginning of the prophage region and one IS30 element ending less than 1 kb from the end of the prophage region, which included B. hansenii ATCC 27552 (see Supplementary Table 3 for a complete list of potential IScream phages).

Additionally, we analysed the species-level taxonomic profiles for two public datasets, generated with Phanta (v.1.0; ref. ⁴⁸), which included a viral database built on representative genomes from MGV (see the original Phanta publication for details about data processing). Each phage species from this database was classified as potential IScream phage if a genome contained in this species bin was identified to be an IScream phage.

The dataset from Liang et al.⁴⁷ included bulk metagenomic sequencing and VLP-enriched sequencing of infants. In this dataset, we quantified the relative taxonomic abundance of IScream-containing phage species in paired bulk and VLP sequencing to identify potential particle formation of IScream phages. Finally, the non-cancer control samples from the dataset of Yachida et al.⁷⁷ was used to calculate prevalence and mean abundance of phage species in healthy individuals.

Clustering of IS30 elements

To cluster IS30 elements across IScream phages, we used the ete3 toolkit (v.3.1.3; ref. ⁷⁸). First, we built a tree for all IS30 proteins on all IScream phages using the ‘standard_fasttree‘ workflow from ete3, consisting of a multiple sequence alignment with Clustal Omega (v.1.2.4) and tree construction using FastTree (v.2.1.8) (Extended Data Fig. 10). Because this analysis showed a clear separation between outward-directed and inward-directed IS30 proteins, we subsequently focused only on the outward-directed IS30. To gain insights into the evolutionary history of IScream phages, we extracted all outward-directed IS30 proteins from our IScream phages, all potential MGV IScream phages and all bona fide bacterial IS30 elements from the ISfinder database. For the MGV phages, we filtered the outward-directed IS30 proteins to be longer than 600 and shorter than 2,200 nucleotides. All proteins were clustered at 70% amino acid similarity over 80% of the alignment, using MMseqs (v.14.7e284; ref. ⁷⁹). A tree was then constructed on the cluster representatives using the ‘standard_fasttree’ workflow in ete3, modified to trim positions in the multiple sequence alignment with more than 90% gaps using trimAl (v.1.4.rev6).

Screening for potential phage IS30 elements in ancient stool metagenomes

To identify potential IScream phages in ancient stool samples, we downloaded the raw data from Wibowo et al.⁴⁶, containing sequencing of desiccated paleofecal samples (1,000–2,000 years old) from the southwestern USA and Mexico. Raw reads were processed and assembled as described above, phages were identified with geNomad, and IS elements were detected with ISEScan. For all contigs that contained IS30 elements within predicted phages, we assessed their DNA damage level with DamageProfiler (v.1.1; ref. ⁸⁰). To prevent inclusion of modern IScream phage IS30 proteins, we filtered the phage-overlapping IS30 proteins for being on contigs recruiting more than 1,000 reads and showing an estimated 5′C>T damage level or an estimated 3′G>A damage level over 1%, resulting in six potential IScream IS30 open reading frames. Note that these filtering steps are rather specific because many more IS30 proteins are identified on shorter contigs, which are not predicted to be of phage origin. The six ancient IS30 proteins were added to the clustered IS30 proteins from MGV, IScream phages and ISfinder, and the tree was recomputed as described above.

Culturing of the B. hansenii IScream phage

B. hansenii (ATCC 27552) was grown anaerobically (90% nitrogen, 5% carbon dioxide and 5% hydrogen) in an anaerobic chamber (Sheldon Manufacturing) in Brain Heart Infusion (Sigma) supplemented (BHIS) with hemin (5 μg ml⁻¹), l-cysteine (1 mg ml⁻¹) and sodium bicarbonate (0.2%).

B. hansenii was grown overnight in pre-reduced BHIS. The culture was pelleted by means of centrifugation (Eppendorf; 5920R) at 4,000g for 15 min, and DNA was extracted using a DNeasy Blood and Tissue Kit (QIAGEN; 69504) following the manufacturer’s instructions for Gram-positive bacteria. Bacterial genome sequencing was performed by Plasmidsaurus using ONT with custom analysis and annotation.

PEG precipitation of B. hansenii phage particles

B. hansenii was grown overnight in 4 l of BHIS. The supernatant from the culture was harvested by means of centrifugation (Eppendorf; 5920R) at 4,198g for 15 min at 4 °C. Sodium chloride was added to the supernatant to a final concentration of 5 M, and PEG 8000 was added to a final concentration of 10%. This solution was stirred for 30 min at 4 °C to dissolve and then left overnight at 4 °C with no agitation. PEG-precipitated phage particles were harvested by centrifugation at 10,000g for 10 min at 4 °C. The supernatant was removed, and the resulting pellets were air-dried for 3–5 min in an inverted position. The pellets were combined and resuspended in a total of 10 ml of SM buffer (100 mM NaCl, 8 mM MgSO₄·7H₂O and 50 mM Tris–HCl (pH 7.5)). An equal volume of chloroform was added, mixing by inverting and centrifuged at 12,000g for 10 min at 4 °C. The aqueous phase was collected, and chloroform treatment was repeated.

PCR validation of phage particles from B. hansenii IScream phage

PCR primers were designed using NCBI Primer-BLAST under default settings. The PCR product size was targeted to be between 150 bp and 250 bp. PCR primers were designed to target (1) B. hansenii gmk gene; (2) an internal region of the IScream phage; and (3) the upstream junction site of the integrated phage and bacterial chromosome, as well as to span the junction of circularization of the IScream phage region (Fig. 4e; the primers are listed in Supplementary Table 4). PCRs were performed on PEG-precipitated samples directly with and without DNase treatment. For the DNase-treated samples, 50 μl of PEG Prep was treated with 5 μl of 10X TURBO DNase buffer, 1.13 μl of TURBO DNase (Invitrogen; AM2239) and 1 μl of RNase (1 mg ml⁻¹; Invitrogen; AM2270), incubated at 37 °C for 1 h and then heat inactivated at 70 °C for 10 min.

PCRs were performed using Q5 high-fidelity DNA polymerase (annealing temperature of 67 °C, annealing time of 30 s and extension time of 5 s at 72 °C). PCRs were run on a 2% agarose gel. A raw gel image is shown in Supplementary Fig. 1.

Transmission electron microscopy of B. hansenii lysate

A 500-ml culture of B. hansenii was grown overnight, as described above. The supernatant from the culture was harvested after 24 h by means of centrifugation (Eppendorf; 5920R) at 4,198g for 15 min at 4 °C. The supernatant was then filtered using a 0.22-μm filter (Fisher Scientific; 09-719C). The supernatant (5 μl) was placed on glow-discharged 200-mesh carbon/Formvar-coated Cu grids (FF300-Cu) and allowed to settle for 3 min. Grids were washed by touching the sample side with two drops of water. Three drops of 1% uranyl acetate in double-distilled water were then added, and the third drop was incubated on the grid for 1 min at room temperature. The remainder of the last drop was removed with filter paper, and the grid was allowed to dry. The samples were then observed on the JEOL JEM-1400 transmission electron microscope at 120 kV, and photographs were taken using a Gatan Orius digital camera.

Statistical analysis and visualization

All statistical tests were performed in R (v.4.2.2). Data visualization was performed using ggplot2 (v.3.5.1), which is part of the tidyverse (v.2.0.0) suite of tools⁸¹.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.