Low-quality papers based on public health data are flooding the scientific literature

July 15, 2025

160

Bundles of paper documents piled high on a desk. — Data from large, open databases appear in hundreds of low quality, repetitive papers. Credit: Artem Cherednik/iStock via Getty

Data from five large open-access health databases are being used to generate thousands of poor-quality, formulaic papers, an analysis has found. Its authors say that the surge in publications could indicate the exploitation of these databases by people using large language models (LLMs) to mass-produce scholarly articles, or even by paper mills — companies that churn out papers to order.

AI linked to explosion of low-quality biomedical research papers

The findings, posted as a preprint on medRxiv on 9 July¹, follow an earlier study² that highlighted an explosion of such papers that used data from the US National Health and Nutrition Examination Survey (NHANES). The latest analysis flags a rising number of studies featuring data from other large health databases, including the UK Biobank and the US Food and Drug Administration’s Adverse Event Reporting System (FAERS), which documents the side effects of drugs.

Between 2021 and 2024, the number of papers using data from these databases rose from around 4,000 to 11,500 — around 5,000 more papers than expected on the basis of previous publication trends.

The study’s authors warn that a large number of these papers — many of which have repetitive, template-like titles — are likely to be of low quality and could flood the scientific literature. Their analysis is intended as “an early warning system … so that peer reviewers, editors and researchers can understand where the vulnerabilities in the system lie”, says co-author Matt Spick, a biomedical scientist at the University of Surrey in Guildford, UK.

Unexpected growth

Spick and his colleagues analysed changes in publication counts, title wording and author affiliations for papers that were based on data from 34 open-access health databases. The team used an algorithm to predict the growth in the numbers of papers expected for each data set from 2014 to 2024 — a period during which text-generating LLM tools such as ChatGPT and Gemini became mainstream.

Biomedical paper retractions have quadrupled in 20 years — why?

When they compared their predictions with actual publication rates, the researchers identified six data sets that had significantly exceeded the growth rates predicted by the algorithm. All but one also showed a rise in the number of papers with ‘template-like’ titles. These data sets were NHANES, UK Biobank, FAERS, the Global Burden of Disease (GBD) study and the Finnish genetic database FinnGen. By 2024, the number of papers using FinnGen data grew by nearly 15 times from 2021, for example, while those using FAERS increased by nearly 4 times and UK Biobank by 2.4 times over the same period.

The researchers also uncovered some dubious papers, which often linked complex health conditions to a single variable. One paper used Mendelian randomization — a technique that helps to determine whether a particular health risk factor causes a disease — to study whether drinking semi-skimmed milk could protect against depression, whereas another looked into how education levels affect someone’s chances of developing a hernia after surgery.

“A lot of those findings might be unsafe, and yet they’re also accessible to the public, and that really worries me,” says Spick.

Low-quality papers based on public health data are flooding the scientific literature

Unexpected growth

US Congress set to reject Trump’s sweeping science budget cuts

My ‘why’: From dental school to a fateful beach trip

what has been lost and what remains

Most Popular

Applied Compute, which lets companies customize models with their own data, is in talks to raise funding at a $1.3B valuation, up from $500M...

The Grizzlies’ patience is paying off for GG Jackson

Trump Issues Late-Night MLK Day Proclamation After Backlash

‘Lord of the Rings’ Star Andy Serkis Wants Original Cast Back For New Movie

Recent Comments

ABOUT US

POPULAR POSTS

Applied Compute, which lets companies customize models with their own data, is in talks to raise funding at a $1.3B valuation, up from $500M...

The Grizzlies’ patience is paying off for GG Jackson

Trump Issues Late-Night MLK Day Proclamation After Backlash

POPULAR CATEGORY