Tuesday, June 2, 2026
No menu items!
HomeNature‘Virtual cells’ aim to turn raw data into predictive models of biology

‘Virtual cells’ aim to turn raw data into predictive models of biology

As every gamer knows, computers can plausibly simulate just about anything from the routine concerns of a household to the crises confronting a multiplanetary civilization. Simulating the fundamental unit of life — the cell — should be a walk in the park. But it’s not.

Each cell is a complex ecosystem of biomolecules that interact with one another and react to external cues in ways that remain poorly understood. And what’s true of one cell type isn’t necessarily true of another. But there is an order to the chaos.

“The cell is a complex system, and a highly robust and resilient system,” says Emma Lundberg, a bioengineer at Stanford University in California. “But it’s also a highly structured system — the cell has an architecture.” Over the past few years, researchers have begun reverse-engineering that architecture to convert vast repositories of molecular data into ‘virtual cells’ — models that simulate the internal environment of cells both at rest and when responding to external triggers.

Several teams are now tapping into deep reservoirs of transcriptomic (gene expression) and other data sets to build models that could reveal the underlying biological bases of disease and possible angles for therapeutic intervention. “We have to think about virtual cells as a means of getting towards a specific goal, and for me, that goal is to be able to accelerate the hypothesis search process,” says Yusuf Roohani, a machine-learning researcher at the Arc Institute in Palo Alto, California.

The field remains far short of a fully functional virtual cell, however. “I don’t think people would sensibly want to claim that they have built a virtual cell unless they need to sell a start-up,” says Fabian Theis, a computational biologist at Helmholtz Centre Munich in Germany. Current models can capture static cell states but struggle to accurately predict dynamic changes. Reaching higher levels of in silico evolution will require ever-greater volumes of diverse data and smart strategies for combining them.

A strong foundation

The artificial-intelligence revolution has been a potent accelerant for enthusiasm around virtual cells, but scientists have grappled with how to build computational cell models for decades. “Even 20-something years ago, we had ‘virtual cell 1.0’, where people were trying to use differential equations to describe systems biology,” says Bo Wang, an AI specialist at the University of Toronto in Canada.

Such models have the advantage of being grounded in measurable, well-understood biochemical and biophysical principles — threading together equations that describe cellular functions including metabolism, communication and movement. “You actually have mechanistic understanding — you can interpret them correctly, and that is very attractive,” says Lundberg.

A sophisticated mathematical model announced in March by a team led by Zaida Luthey-Schulten at the University of Illinois at Urbana-Champaign, for instance, realistically replicated cell division in a highly modified version of Mycoplasma bacteria1. And Paul Macklin, an engineer at Indiana University in Bloomington, and his team have spent more than a decade developing a framework called PhysiCell to simulate how human cells and tissues respond to diverse environmental stimuli. This simulator has proved useful for modelling cancer biology, including factors driving progression or response to immunotherapy, Macklin says.

Paul Macklin presenting to a room with a projected image of a 3D virtual tissue.

Paul Macklin demonstrating 3D tumour-immune simulations, Indiana University.Credit: Photo courtesy of Indiana University

These successes notwithstanding, mathematical models are inherently limited by researchers’ understanding of cell biology. Initiatives such as the Human Cell Atlas have produced vast amounts of gene-expression and other data, including proteomics and epigenetics, but it’s extremely difficult to extract biological meaning from thousands upon thousands of molecular interactions. This is when AI models shine, says Maria Brbić, an AI researcher at the Swiss Federal Institute of Technology in Lausanne: “They’re really good at exploring combinatorial space.”

Opinions vary about which capabilities would define a true virtual cell, but any meaningful simulation should at least be able to represent the baseline state of a given cell type, and then project how a particular perturbation alters that state. Many attempts have relied on deep-learning-based ‘foundation models’, in which AI algorithms identify patterns in vast collections of unlabelled experimental data.

Roohani draws a parallel with ChatGPT, a chatbot powered by a foundation model that uses patterns gleaned from Internet text to produce coherent responses to almost any user query. “You can create more general-purpose representations across a broader range of cellular and biological contexts,” he says. In a best-case scenario, a biological foundation model would be able to extrapolate how various cell types will respond to conditions that are not included in the original training set, and even make meaningful predictions for cell types that it hasn’t encountered before.

Single-cell gene-expression data are currently the preferred way of educating biological foundation models about different cell types, and such data are readily available. Roohani and his colleagues have developed a database called scBaseCount, which uses AI to continually collect and uniformly process transcriptomic data for model-training purposes. The collection includes around half a billion cells, and counting. “That’s a few times more than the next-largest single-cell data repository,” says Roohani.

But one cannot simply build a representation based solely on a cell’s defining features — known in the context of AI models as an embedding. A virtual cell must also learn how different perturbations affect the cellular environment. Fleshing out these details requires experiments in which researchers systematically inactivate different genes or expose the cells to a diverse range of drugs. “We should have causal data to build causal models,” says Wang. One such collection is the X-Atlas/Pisces data set, compiled by Xaira Therapeutics, a drug company in South San Francisco, California. Available on the open-source AI platform HuggingFace, Pisces comprises gene-expression data from 25.6 million cells of various lineages that had undergone targeted gene disruption.

The pitfalls of perturbation

In theory, the resulting models could help users to infer which genetic abnormalities drive the growth of a particular tumour type or to pinpoint drug categories that stabilize metabolic issues in diseased cells, and some foundation models are on the cusp of achieving such capabilities.

In January, for example, Roohani and his colleagues described Stack2, a model trained on the scBaseCount data set. The researchers were able to use these data to produce a ‘perturbation atlas’ that predicted the effects of different drug treatments in 28 distinct human tissues. And in March, Xaira announced its X-Cell model3, trained on the company’s Pisces data set. According to Wang, who is also head of biomedical AI at Xaira, X-Cell was able to predict changes in gene expression underlying the activation of immune T cells even though it had not been trained on that process. This allowed the company’s scientists to predict mechanisms for switching off that activation — a potentially useful intervention in inflammatory disorders or other immune conditions. “We not only confirmed known inactivators, such as CD3 and its family, we also found a few putative T-cell inactivators,” says Wang.

Predicting the effects of cellular perturbation remains challenging, however, and Wang cautions that these models are only early steps in that direction. “So far, everybody’s just focusing on cell lines, which are relatively simple biological systems,” he says. These models might not accurately map to actual organs and tissues, and collecting training data from primary cell types — those taken directly from human samples — at a meaningful scale is daunting.

Researchers have also struggled to demonstrate clear performance gains from transcriptome-based foundation models relative to simpler mathematical methods. In 2025, the Arc Institute hosted the Virtual Cell Challenge, giving teams an opportunity to test the predictive performance of their models head-to-head. Although a success in terms of enthusiasm and engagement — Roohani says the event attracted some 5,000 participants from more than 100 countries — none of the pure AI models prevailed over those that incorporated conventional statistical methods.

Brbić has dealt with similar issues in assessing the robustness of deep-learning models. One problem, in her view, is that conventional performance metrics focus on capturing broad transcriptomic differences between perturbed and unperturbed cells. This means that small but biologically meaningful changes might be drowned out by irrelevant background variation between samples, confounding AI analysis. “Single-cell RNA sequencing data is noisy,” says Brbić. “The kind of differences that we observe may be true biological differences but might also be caused by experimental artefacts or other sources of variation.”

In 2025, Brbić and her colleagues released a benchmarking tool called Systema, which allows users to eliminate noise and home in on perturbation-specific changes in gene expression4. Roohani’s team’s perturbation-prediction model, called State, which is trained to recognize the inherent variability in cell populations5 — also addresses this problem. By combining this approach with a performance metric that, like Systema, zooms in on perturbation-specific effects rather than overall gene expression, State was able to accurately predict about one-third of the genes most strongly affected by a given perturbation in a test data set. That’s a big improvement on the 7% achieved using conventional methods.

Portrait of Maria Brbić outside.

Maria Brbić uses artificial intelligence to incorporate transcriptome data into virtual cells.Credit: Maria Brbić

Completing the picture

RELATED ARTICLES

Most Popular

Recent Comments