Birth of protein folds and functions in the virome

August 26, 2024

181

Preparation of protein sequences

Protein sequences for eukaryotic viruses present in RefSeq⁵⁴ were collected through the NCBI Viruses portal (https://www.ncbi.nlm.nih.gov/labs/virus) in July 2022. GenPept files were downloaded for viruses that were annotated by NCBI to have an eukaryotic host. Because not all viruses have a host labelled by NCBI, GenPept files of human-infecting viruses annotated by ViralZone (https://viralzone.expasy.org/678) were also downloaded. Finally, proteins from all coronaviruses present in RefSeq, regardless of NCBI-labelled host, were downloaded.

Each GenPept file was processed such that polyproteins with defined âmature peptideâ fields produced separate protein sequences for each mature peptide. GenPept files without a mature peptide field were output as full amino acid sequences. These processing steps are present in the vpSAT github directory (https://github.com/jnoms/vpSAT) in the process_gbks.py file. Proteins larger than 1,500 residues, or in some cases 1,000 residues, were excluded. Only 1,706 proteins were excluded for this reason.

Structure prediction

MSAs were generated with MMseqs2 release version b0b8e85f3b8437c10a666e3ea35c78c0ad0d7ec2. To increase MSA generation speed, the RefSeq virus protein database (downloaded on 6 June 2022) was used as the target database for MSA generation. Structures were predicted with ColabFold¹⁵ (downloaded 22 June 2022). The majority of samples used three recycles, three models, stop_at_score=70, and stop_at_score_below=40. MMseqs2 and Colabfold_batch were run with a Nextflow⁵⁵ pipeline, and all parameters used can be found at https://github.com/jnoms/vpSAT. Information on all viruses and structures included in this manuscript is present in Supplementary Table 1.

Protein cluster generation

All proteins were initially clustered with MMseqs2, with a requirement of at least 20% sequence identity and 70% query and target coverage. MMseqs2 cluster mode 0 was used, meaning that many but not all pairs of aligned proteins are placed into the same sequence cluster. Predicted structures for each sequence cluster representative were subjected to an all-by-all alignment using Foldseek¹⁷, requiring the alignment to consist of at least 70% query and target coverage and an alignment E-value less than 0.001. The resultant structural alignment file was then filtered using SAT aln_filter to keep alignments with a TMscore of at least 0.4. Clusters were generated from this alignment file using SAT aln_cluster in a similar manner as Foldseek cluster mode 1, wherein all query-target pairs are assigned to the same cluster. Cluster information from sequence and structure clustering were merged using SAT aln_expand_clusters. Taxonomic counts information was generated using SAT aln_taxa_counts, producing a âtidyâ table for each cluster_ID with the number of members of each taxon at multiple taxonomy levels. Taxonomy information was also added directly to the merged cluster file using SAT aln_add_taxonomy.

Cluster purity analysis

To determine the structural consistency of the clusters, all clusters with at least 100 members were selected for analysis. DALI was used to align the cluster representative with each cluster member. Clusters whose members were on average smaller than 150 residues were excluded. This led to the analysis of 49 clusters. Cluster members that failed to align to their representative were assigned a z value of 0. For each cluster, the average z-score between the representative and each member was determined and plotted. All scripts used to run DALI can be found in vpSATâs dali_format_inputs.sh and dali.sh files. Dalilite version 5 was used. DALI output files were parsed into a tabular format using SATâs aln_parse_dali.

Phylogenetics

Phylogenetic reconstructions were conducted using all sequence cluster representatives, or in the cases of clusters 56 and 735, all members within each cluster. For the nucleoside transporter tree, all herpesvirus sequence representatives of cluster 119, as well as a F. catus gammaherpesvirus 1 protein (YP_009173937) from a singleton cluster, were used as queries. Iterative sequence similarity searches against the NCBI non-redundant database were performed using standalone PSI-BLAST v2.15.0, using the following parameters⁵⁶: -num_iterations 10, -max_hsps 1, -subject_besthit, -gapopen 9, -inclusion_ethresh 1e-15, -evalue 1e-10, and -qcov_hsp_perc 70. For the LigT-like PDE tree, this search was restricted to only viral targets. Each of these protein sets were then clustered by utilizing mmseqs2 v15.6f452 with high sensitivity (command line option: -s 7.5) to compress the amount of highly similar sequences into cluster representatives. Subsequently, these sequence sets were aligned using Clustal Omega v1.2.4 with default settings⁵⁷. Comprehensive taxonomic information for each aligned sequence was integrated into the unique sequence identifiers by utilizing the biopython v1.81 package⁵⁸. Phylogenetic trees were reconstructed using IQTREE v2.3.3⁵⁹ with -m TEST -B 1000 options for model testing and bootstrapping. The best model was selected for each tree based on Bayesian Information Criterion (BIC), and were as follows: Nucleoside transporters, VTâ+âFâ+âG4; LigTs, VTâ+âFâ+âG4; cluster 28, VTâ+âG4; cluster 55, VTâ+âIâ+âG4; cluster 56, VTâ+âG4; cluster 735, VTâ+âIâ+âG4. Trees were visualized with the Interactive Tree of Life (iTOL)⁶⁰. Code used for this analysis can be found at https://github.com/Doudna-lab/nomburg_j-LigT_phylogeny.

Structural alignments against the AlphaFold databases

In Fig. 1i, Foldseek was used to align a protein representative from every viral protein cluster against 2.3 million protein cluster representatives from the AlphaFold database³. For Fig. 3, all 67,715 viral protein structures were searched against the pre-made Foldseek databases of the original release of the AlphaFold database, consisting of proteins from 48 organisms and including members of the bacterial, eukaryote, and archaeal superkingdoms. For this search, the full AlphaFold database of over 200âM structures was not used because it contains many viral proteins misannotated as non-viral proteins (these misannotations reflect errors in Uniprot metadata). Alignments were filtered to keep only those with a minimum TMscore of 0.4 and an E-value of less than 0.001.

DALI alignments of specific non-viral proteins against the viral protein database

Following Foldseek alignments against the AlphaFold database, specific hits of interest (for example, ENT4) were selected. These structures were downloaded and imported to the DALI database format using vpSATâs dali_format_inputs.sh. They were then aligned against the full viral protein structure database using vpSATâs dali.sh, which lists all parameters. Dalilite version 5 was used. DALI output files were parsed into a tabular format using SATâs aln_parse_dali.

Identification of annotated protein sequence clusters

Each protein in the database was searched against the Pfam²³, CDD²⁴, and TIGRFAM²⁵ databases using InterProScan²². A sequence cluster was considered annotated if more than 25% of members had any InterProScan alignment, and was considered unannotated if otherwise. Note that some proteins without an InterProScan alignment have existing annotations through other methods, including manual curation. Values of RMSD in Fig. 3 were calculated using DALI.

DALI alignments to identify shared domains

This analysis used the structure representatives from clusters with at least 2 members, resulting in 5,700 cluster representatives. Structures from these representatives were imported to the DALI database format using vpSATâs dali_format_inputs.sh. To compare eukaryotic virus protein cluster representatives, an all-by-all alignment was conducted using vpSATâs dali.sh, which lists all parameters.

Dalilite version 5 was used. DALI output files were parsed into a tabular format using SATâs aln_parse_dali. All DALI alignments were filtered for an alignment length of at least 120, and for a z-score greater than or equal to (alignment length/10)âââ4.

MSA generation using the full ColabFold MMseqs2 database

We selected the protein cluster representatives from the top 100 protein clusters by size, as well as 100 randomly selected singleton clusters, for analysis. ColabFold was used with FASTA inputs, such that MSAs were generated using the MMseqs2 ColabFold server (which maps each sequence against UniRef, BFD and Mgnify), and this MSA was used for structure prediction.

Benchmarking sequence and structure methods

For all protein clusters with at least two sequence clusters, we conducted all-by-all alignments between members using MMseqs2 (version b0b8e85f3b8437c10a666e3ea35c78c0ad0d7ec2), DIAMOND blastp⁶¹ (version 0.9.14), or jackhmmer⁶² (version 3.1b2). These alignments and subsequent clustering occur separately for each protein cluster. From these alignments, we conducted connected-component clustering using sat.py aln_cluster. Here, all proteins that align will be assigned to the same resultant cluster. Thus, each original protein cluster (determined through our approach, combining sequence alignment with MMseqs2 and structure alignment with Foldseek) now has a set of clusters identified through each of the sequence-only methods. We then measured, for each original protein cluster, how many clusters created by each of the sequence-only methods and how many proteins fall into the largest cluster generated by these sequence methods.

For benchmarking virusânon-virus alignments, we conducted sequence alignments (again using MMseqs2, DIAMOND blastp, and jackhmmer) analogous to the DALI structural alignments present in Extended Data Fig. 4, using the same query against all viral proteins included in the dataset. We then determined the fraction of DALI-identified targets were identified for each non-viral query and through each sequence method.

For the comparison between hhPred⁴³ and DALI, we identified 4,409 sequence clusters that contained more than 1 member and for which fewer than one-quarter of members had an InterProScan alignment. We then identified sequence cluster representatives that were well folded, with an average pLDDT of at least 70. This resulted in a final set of 1,326 proteins. We used DALI to align each of these proteins against the PDB25 database provided by the DALI authors. Alignments were considered high-confidence if they contained a z-score of at least 7. DALI alignments were conducted with vpSATâs dali.sh. For hhPred searches, we established a local pipeline using HHsuiteâs (v3.3.0) HHblits and HHsearch modules. For each query protein, we first used HHblits to align them against the Uniref30 HMM database provided by the HHsuite authors, using the flags -n 2 and -cov 20. We then used HHsearch to align each resultant MSA against the HHsuite-provided PDB database with the flag -cov 20. Alignments were considered high-confidence if they had an E-value of less than or equal to 0.001.

Searching the TCDB

We used a map of PDB accession to TCDB classification (https://www.tcdb.org/cgi-bin/projectv/public/pdb.py) to download all experimental structures associated with TCDB classifications. For subsequent processing, we used a maximum of five structures per TCDB classification. One structure was excluded (PDB: 1HXI) as it is highly truncated. Nine additional structures failed to import to DALI database files, typically due to small protein size. For PDB entries that contained multiple chains, we selected the first chain for alignment. Due to the absence of experimental structures, the AlphaFold models for ENT3 (AF-Q9BZD2-F1-model_v4) and ENT4 (AF-Q7RTT9-F1-model_v4) were added to the dataset. For the 46 protein structures with multiple classifications, one classification was chosen at random. This ultimately resulted in a dataset of 1,812 structures from 485 classifications, with an average of 3.7 structures per classification. Structures were imported to the DALI database format using vpSATâs dali_format_inputs.sh. The predicted structure of EBV BMRF2 (YP_001129455) was aligned against this structure database using dali.sh.

PDE cloning and activity assays

Two tandem STREP2 tags, following a GGS linker, were appended to the end of each putative LigT-like PDE. Sequences were codon-optimized for humans, and gBlocks encoding each product were ordered from IDT and cloned into a custom lentiviral expression vector. PDE mutants have dual H>A mutations of the catalytic histidines (or, in the case of MHV NS2a and pigeonpox PDE, one H>A and one H>R mutation).

The 293T cells were seeded into 96-well plates at 20,000 cells per well. The 293T cells were kindly provided by the Ott laboratory, and were originally from ATCC. The 293T cells were screened for Mycoplasma within the last year, and were not otherwise authenticated. The day after plating, each well was transfected with 15âng STING (pMSCV-hygro-STING R232, Addgene 102608), 20âng firefly luciferase driven by an IFNB promoter (IFN-Beta_pGL3, Addgene 102597), 5âng Renilla luciferase (pRL-TK, Promega E2241), and 20âng of each putative PDE using the Mirus TransITX2 transfection reagent. After at least 4âh, cells were treated with 0.1âÎ¼M diABZI (Invivogen) or transfected with 10âÎ¼gâml^â1 2â²,3â²-cGAMP (Invivogen) using TransITX2. The next day, firefly and Renilla luciferase were measured using the Promega Dual-Glo luciferase assay system. Three wells were transfected per condition, and experiments are representative of at least two independent experiments. The âno STINGâ conditions were transfected with both reporters and a noncoding transgene, but no STING plasmid.

PDE western blots

The 293T cells were plated in 6-well dishes at 5âÃâ10⁵ cells in 2âml per well. The next day, each well was transfected with 200âng of the indicated transgene using Mirus TransITX2. The following day, cells were lysed using RIPA buffer (ThermoFisher) supplemented with protease/phosphatase inhibitor (ThermoFisher), and lysate protein concentrations were determined using the Pierce BCA assay kit. All samples were then normalized to the same protein concentration. Bio-Rad Criterion 4%â20% acrylamide gels were loaded with 30âÂµg of protein per well, followed by transfer to a 0.2-Âµm nitrocellulose membrane. For visualization of the Strep-tagged PDEs, the Streptactin HRP (IBA 2-1502-001IAB) antibody was used (1:100,000 dilution, 1âh at room temperature). For visualization of GAPDH, we used Santa Cruz Biotech Mouse anti GAPDH (sc-365062) primary (1:1,000 dilution, incubation at 4âÂ°C overnight) and ECL Anti-mouse IgG (Amersham NXA931) secondary (1:5,000 dilution, 1âh at room temperature).

Recombinant protein expression and purification

Expression plasmids for pigeon poxvirus PDE (wild-type and H72âA/H167R), MHV nonstructural protein 2A (NS2A), and T4 anti-CBASS protein 1 (Acb1) were cloned into custom pET-based vectors by Gibson assembly to yield N-terminal His₁₀-MBP-TEV constructs. Proteins were expressed from 4âl Escherichia coli Rosetta 2 (DE3) pLysS by growing to an of OD₆₀₀ of 0.4â0.6 in 2Ã yeast extract tryptone medium at 37âÂ°C and induced with 0.5âmM isopropyl Î²-d-1-thiogalactopyranoside. After induction, cells expressing each protein were grown overnight at 16âÂ°C to an OD₆₀₀ of 1.2â1.4. Cells were collected by centrifugation for 20âmin at 4,000ârpm at 4âÂ°C and resuspended in 20âmM Tris-HCl, pH 8.0, 10âmM imidazole, 2âmM MgCl₂, 500âmM KCl, 10% glycerol, 0.5âmM TCEP and Roche protease inhibitor. Cells were lysed by sonication and cell lysate was clarified by centrifugation at 17,000g, 4âÂ°C for 0.5âh. The supernatant was bound to 5âml Nickel-NTA affinity resin for 1âh at 4âÂ°C. Supernatant was discarded and resin was washed 5âÃâ30âml wash buffer (20âmM Tris-HCl, pH 8.0, 500âmM KCl, 30âmM imidazole, 10% glycerol and 0.5âmM Tris(2-carboxyethhyl) phosphate). Protein was eluted in 10âml elution buffer (20âmM Tris-HCl, pH 8.0, 500âmM KCl, 300âmM imidazole, 10% glycerol, and 0.5âmM Tris(2-carboxyethyl) phosphate). Each protein was concentrated to 10âmgâml^â1 during buffer exchange to storage buffer (20âmM Tris-HCl, pH 8.0, 500âmM KCl, 30âmM imidazole, 10% glycerol and 0.5âmM Tris(2-chloroethyl) phosphate) using a 10âkDa MWCO centrifugal filter (Amicon). A total of 5â15âmg target protein fused to N-terminal His₁₀âMBPâTEV was stored at â80âÂ°C.

In vitro characterization of PDEs

Recombinant enzymes were assessed for PDE activity by in vitro cGAMP degradation reactions and downstream analysis by TLC. Reactions were initiated by the addition of recombinant enzyme (40âÎ¼M) in reaction buffer (50âmM Tris, pH 8.0, 10âmM MgCl_2, 100âmM NaCl) to 1.25âmM 2â²,3â²-cGAMP or 3â²,3â²-cGAMP (Biolog). The reaction mixture was incubated at 37âÂ°C for 18âh and stopped by vortexing for 20âs.

Silica gel TLC plates (5âcmâÃâ10âcm) with fluorescent indicator 254ânm were spotted with 2âÎ¼l in vitro enzymatic reaction. Separation was performed in an eluent of n-propanol/ammonium hydroxide/water (11:7:2âv/v/v). The plate was allowed to dry fully and visualized with a short-wave ultraviolet light source at 254ânm.

Data analysis and plotting

All analysis, plotting, and statistical tests used R version 4.0.3. The genome type and average genome size were determined from information downloaded from the NCBI Virus portal (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/).

Reporting summary

Further information on research design is available in theÂ Nature Portfolio Reporting Summary linked to this article.

Birth of protein folds and functions in the virome

Preparation of protein sequences

Structure prediction

Protein cluster generation

Cluster purity analysis

Phylogenetics

Structural alignments against the AlphaFold databases

DALI alignments of specific non-viral proteins against the viral protein database

Identification of annotated protein sequence clusters

DALI alignments to identify shared domains

MSA generation using the full ColabFold MMseqs2 database

Benchmarking sequence and structure methods

Searching the TCDB

PDE cloning and activity assays

PDE western blots

Recombinant protein expression and purification

In vitro characterization of PDEs

Data analysis and plotting

Reporting summary

Stem cells banish severe autoimmune disease for 15 years

How many elementary particles are there?

Mathematicians are developing rules for AI use — other fields should follow

Most Popular

Sperry Left Sobbing, Incon-Sole-Able After Sebago Reveals Docksides Boat Shoe-Shaped Boat

Dua Lipa Goes From Schiaparelli to Chanel for Second Wedding Ceremony

Founders Fund’s outlier bet on humanely killed fish

Uptown Records Relaunches, Yung Miami And G Herbo Among First Signees

Recent Comments

ABOUT US

POPULAR POSTS

Sperry Left Sobbing, Incon-Sole-Able After Sebago Reveals Docksides Boat Shoe-Shaped Boat

Dua Lipa Goes From Schiaparelli to Chanel for Second Wedding Ceremony

Founders Fund’s outlier bet on humanely killed fish

POPULAR CATEGORY