Thursday, May 14, 2026
No menu items!
HomeNatureHallucinated citations highest in social sciences preprints site

Hallucinated citations highest in social sciences preprints site

A magnifying glass resting on a plain surface amongst numerous crumpled paper balls.

Analyses of research repositories are estimating the rates of hallucinated citations in research papers.Credit: patpitchaya/iStock via Getty

The problem of artificial intelligence models ‘hallucinating’ non-existent citations has recently shot to prominence. Now a team of researchers has sifted through 2.5 million papers and preprints to provide the best assessment of their prevalence yet.

Their audit encompassed 111 million references in papers and preprints listed in major repositories including arXiv, bioRxiv, Social Science Research Network (SSRN), and PubMed Central servers, and found that there were 146,932 hallucinated citations in material published in 2025 alone.

The analysis also suggests that the prevalence of hallucinated citations depends on the area of research. SSRN, a preprint server for social sciences research, had the highest rate of hallucinated citations at nearly 2%, almost five times higher than any other major repository.

“We were really amazed by the overall magnitude and dynamics of the whole body of hallucinated citations,” says Yian Yin, assistant professor of information science at Cornell University in Ithaca, New York state, and a co-author of the study.

The analysis was posted on the preprint server arXiv1 and has not been peer-reviewed.

Bibliographic hallucinations

Yin and his colleagues were prompted to investigate the scale of the problem after spotting some references to unfamiliar work, supposedly authored by researchers they knew. “I know these authors,” says Yin, “and I’m 90% sure they don’t have a paper on that.”

To quantify the scale of the problem, the researchers extracted reference titles from millions of manuscripts and checked them against Semantic Scholar, OpenAlex and Google Scholar. References that could not be matched, and that an LLM judged to be intended as academic sources, were flagged as unmatched. Because bibliographic errors have always existed, the researchers only counted faulty references appearing in material published after 2022, the year in which ChatGPT, the first publicly available LLM, was launched.

The analysis found that the rates of hallucinated citations varied between different repositories. SSRN ranked first with 1.91% of citations in studies posted there by August 2025 deemed to be hallucinations. ArXiv, a physical sciences repository, ranked second, with 0.39% of its citations incorrect or referring to non-existent papers or researchers.

The PubMed Central biomedical-science database had a rate of 0.27% hallucinated citations in peer-reviewed publications. BioRxiv, a preprint server specializing in biological sciences, had a rate of 0.21%.

Hallucinated citations are more prevalent in work authored by researchers with little pre-2022 publication history. When fake citations occur, they disproportionately credit already established, highly cited, often male authors, the study found.

Safeguards vary

RELATED ARTICLES

Most Popular

Recent Comments