I didn’t set out to build a career on other people’s data. But five years after my first secondary-data analysis, I’m still doing it.
In 2018, halfway through my PhD programme at the University of Queensland in Brisbane, Australia, I discovered a previously unknown virus in the laboratory’s Aedes aegypti mosquito cell lines. Insect cells often harbour persistent, unnoticed viral infections, so the finding wasn’t entirely surprising. But this new virus was uncharacterized. We found that it couldn’t infect mammalian cells and, unexpectedly, that it modestly reduced replication of the dengue virus. That drew our attention — insect-specific viruses that interfere with human pathogens could have implications for understanding, and potentially disrupting, how mosquitoes transmit disease.
My doctoral adviser, molecular virologist Sassan Asgari, excitedly pointed me towards other data sets from our lab, and encouraged me to widen the search. He wanted to know how common this virus was across A. aegypti cells in our lab and others. Luckily, transcriptomic data sets were available from mosquito researchers all over the world. Before long, I had downloaded and examined around 3,000 of them and had traced the virus’s evolutionary history across the globe1.
Then, towards the end of my PhD programme, I came across data from the lab of Alexander Khromykh, a virologist at the University of Queensland, where I still work. Khromykh studies the role of non-coding RNA in extracellular vesicles during viral infections. Taking a fresh look at his lab’s published data, I found something unexpected: viruses seemed to be chopping up cellular RNA in ways that hadn’t been seen before. That reanalysis led to an introductory e-mail, then a conversation, then a collaboration. Alex and I are now co-investigators on a national grant built on that initial finding.
For early-career researchers, already-published data are a golden opportunity — a way to generate data for publications and funding applications at little to no cost. Doing so requires only a question, a laptop with the programming languages R or Python installed and a willingness to look at old data from a new angle.
Most researchers, in my experience, are glad to know that their data are being used in this way. Some of my e-mails have led to collaborations, and others have led authors to share metadata that weren’t included in their original publication. Occasionally, the original authors have the samples or equipment necessary to test the results in a way that you can’t; a quick experiment on their end might confirm an association that becomes preliminary data for your next grant.
The data were free
Genomic data of the type I was mining are particularly suited to secondary analysis. The Sequence Read Archive (SRA), hosted by the National Center for Biotechnology Information, part of the US National Institutes of Health, holds more than 50 petabytes of data, much of which are deposited and rarely used again. In 2022, a project called Serratus aligned billions of these reads against viral reference genomes to identify thousands of new viral sequences2, expanding known RNA virus diversity by an order of magnitude. These large-scale efforts demonstrate what becomes possible when secondary data analysis is taken seriously.
And the pattern holds across the sciences. Many clinical-trial data sets, ecological surveys and medical-imaging archives are available online and ripe for the picking. Published analyses often just scratch the surface of what the data can reveal.

Taking a new look at old data can open doors, says Rhys Parry.Rhys Parry / UQ
Funding agencies and publishers require researchers to archive our data, to ensure reproducibility and verification of results. But reproducibility is not the only thing that archived data are good for; every data set contains associations beyond those found by the researchers who generated it. New methods arise, new hypotheses emerge and fields shift in ways that can make old data new again. There is an opportunity to bring fresh angles to existing data, to find new associations and, ideally, to validate them.
The most interesting reanalyses tend to involve combining data types — proteomics with transcriptomics, say, or satellite imagery and survey data. Start with data sets for which you understand the underlying science but can ask a question that the original authors did not. But first, check the metadata. If you can’t understand the system, treatment, time point, replicates and platform without too much detective work, reanalysing these data is probably not worth the effort.
Not all data sets or analyses will provide something new. I’ve downloaded thousands of data sets that went nowhere. But the cost of searching is low, and null results can be as informative as positive ones. A well-executed secondary analysis can be published, cited and used as preliminary data, on a par with any other scientific output.

