Friday, June 6, 2025
No menu items!
HomeNatureWeb-scraping AI bots cause disruption for scientific databases and journals

Web-scraping AI bots cause disruption for scientific databases and journals

hot of a laptop with an error message on the screen in an empty server room.

Some websites have been overwhelmed by the sheer volume of bot traffic. Credit: Marco VDM/Getty

In February, the online image repository DiscoverLife, which contains nearly 3 million photographs of different species, started to receive millions of hits to its website every day — a much higher volume than normal. At times, this spike in traffic was so high that it slowed the site down to the point that it became unusable. The culprit? Bots.

These automated programs, which attempt to ‘scrape’ large amounts ofcontent from websites, are increasingly becoming a headache for scholarly publishers and researchers who run sites hosting journal papers, databases and other resources.

Much of the bot traffic comes from anonymized IP addresses, and the sudden increase has led many website owners to suspect that these web-scrapers are gathering data to train generative artificial intelligence (AI) tools such as chatbots and image generators.

“It’s the wild west at the moment,” says Andrew Pitts, the chief executive of PSI, a company based in Oxford, UK, that provides a global repository of validated IP addresses for the scholarly communications community. “The biggest issue is the sheer volume of requests [to access a website], which is causing strain on their systems. It costs money and causes disruption to genuine users.”

Those that run affected sites are working on ways to block the bots and reduce the disruption they cause. But this is no easy task, especially for organizations with limited resources. “These smaller ventures could go extinct if these sorts of issues are not dealt with,” says Michael Orr, a zoologist at the Stuttgart State Museum of National History in Germany.

A flood of bots

Internet bots have been around for decades, and some have been useful. For example, Google and other search engines have bots that scan millions of webpages to identify and retrieve content. But the rise of generative AI has led to a deluge of bots, including many ‘bad’ ones that scrape without permission.

This year, the BMJ, a publisher of medical journals based in London, has seen bot traffic to its websites surpass that of real users. The aggressive behaviour of these bots overloaded the publisher’s servers and led to interruptions in services for legitimate customers, says Ian Mulvany, BMJ’s chief technology officer.

Other publishers report similar issues. “We’ve seen a huge increase in what we call ‘bad bot’ traffic,” says Jes Kainth, a service delivery director based in Brighton, UK, at Highwire Press, an Internet hosting service that specializes in scholarly publications. “It’s a big problem.”

The Confederation of Open Access Repositories (COAR) reported in April that more than 90% of 66 members it surveyed had experienced AI bots scraping content from their sites — of which roughly two-thirds had experienced service disruptions because of them. “Repositories are open access, so in a sense, we welcome the reuse of the contents,” says Kathleen Shearer, COAR’s executive director. “But some of these bots are super aggressive, and it’s leading to service outages and significant operational problems.”

Training data

One factor driving the rise in AI bots was a revelation that came with the release of DeepSeek, a Chinese-built large language model (LLM). Prior to that, most LLMs required a huge amount of computational power to create, explains Rohit Prajapati, a development and operations manager at Highwire Press. But the developers behind DeepSeek showed that an LLM that rivals popular generative-AI tools could be made with much fewer resources, kickstarting an explosion of bots seeking to scrape the data needed to train this type of model.

RELATED ARTICLES

Most Popular

Recent Comments