Wednesday, July 16, 2025
No menu items!
HomeNatureLow-quality papers based on public health data are flooding the scientific literature

Low-quality papers based on public health data are flooding the scientific literature

Bundles of paper documents piled high on a desk.

Data from large, open databases appear in hundreds of low quality, repetitive papers. Credit: Artem Cherednik/iStock via Getty

Data from five large open-access health databases are being used to generate thousands of poor-quality, formulaic papers, an analysis has found. Its authors say that the surge in publications could indicate the exploitation of these databases by people using large language models (LLMs) to mass-produce scholarly articles, or even by paper mills — companies that churn out papers to order.

The findings, posted as a preprint on medRxiv on 9 July1, follow an earlier study2 that highlighted an explosion of such papers that used data from the US National Health and Nutrition Examination Survey (NHANES). The latest analysis flags a rising number of studies featuring data from other large health databases, including the UK Biobank and the US Food and Drug Administration’s Adverse Event Reporting System (FAERS), which documents the side effects of drugs.

Between 2021 and 2024, the number of papers using data from these databases rose from around 4,000 to 11,500 — around 5,000 more papers than expected on the basis of previous publication trends.

The study’s authors warn that a large number of these papers — many of which have repetitive, template-like titles — are likely to be of low quality and could flood the scientific literature. Their analysis is intended as “an early warning system … so that peer reviewers, editors and researchers can understand where the vulnerabilities in the system lie”, says co-author Matt Spick, a biomedical scientist at the University of Surrey in Guildford, UK.

Unexpected growth

Spick and his colleagues analysed changes in publication counts, title wording and author affiliations for papers that were based on data from 34 open-access health databases. The team used an algorithm to predict the growth in the numbers of papers expected for each data set from 2014 to 2024 — a period during which text-generating LLM tools such as ChatGPT and Gemini became mainstream.

When they compared their predictions with actual publication rates, the researchers identified six data sets that had significantly exceeded the growth rates predicted by the algorithm. All but one also showed a rise in the number of papers with ‘template-like’ titles. These data sets were NHANES, UK Biobank, FAERS, the Global Burden of Disease (GBD) study and the Finnish genetic database FinnGen. By 2024, the number of papers using FinnGen data grew by nearly 15 times from 2021, for example, while those using FAERS increased by nearly 4 times and UK Biobank by 2.4 times over the same period.

The researchers also uncovered some dubious papers, which often linked complex health conditions to a single variable. One paper used Mendelian randomization — a technique that helps to determine whether a particular health risk factor causes a disease — to study whether drinking semi-skimmed milk could protect against depression, whereas another looked into how education levels affect someone’s chances of developing a hernia after surgery.

“A lot of those findings might be unsafe, and yet they’re also accessible to the public, and that really worries me,” says Spick.

RELATED ARTICLES

Most Popular

Recent Comments