Web-scraping AI bots cause disruption for scientific databases and journals

Featured image for article: Web-scraping AI bots cause disruption for scientific databases and journals

Short Summary

AI-driven web-scraping bots are generating enormous traffic spikes on scientific databases and journals, leading to site slowdowns and operational challenges. This situation highlights emerging issues related to automated data collection for AI training.

Long Summary

In early 2025, DiscoverLife, an extensive online image repository hosting nearly three million photographs of various species, experienced unprecedented spikes in daily website visits. These traffic surges, caused by AI-driven web-scraping bots gathering training data, overwhelmed the platform to the extent that it became temporarily unusable. The surge in activity underscores a growing trend where automated programs extract large volumes of data from academic and scientific resources.

Scientific databases and journal websites, which traditionally serve researchers and academics, are now facing significant challenges managing the increased load caused by machine-driven data collection. The bots, designed to facilitate artificial-intelligence applications by mining vast datasets, inadvertently impose heavy traffic, risking service stability and accessibility for human users.

This trend raises concerns about the balance between open access to scientific data and the operational capacity of research infrastructures. As AI technologies proliferate, the demand for large, high-quality datasets intensifies, intensifying pressure on scientific publishing platforms and data repositories. The implications extend beyond technical disruptions to encompass broader discussions on data usage policies and infrastructure investments.

Academic stakeholders are increasingly aware of the need to implement measures to mitigate the impact of such automated traffic. Solutions may involve enhanced bot detection technologies, revised access protocols, and collaborations between publishers, data hosts, and AI developers to establish sustainable data-sharing frameworks. The article highlights an urgent need for strategies that accommodate AI’s data needs without compromising platform performance or access equity.

Further discussion revolves around the ethical and practical challenges presented by AI's hunger for data, including potential overuse of publicly available resources and the requirement for transparent data governance. This situation exemplifies a pivotal moment in the intersection between technology advancement and the operational realities of scientific communication infrastructure.

The article also situates this issue within the wider context of machine learning, publishing, communication, and information technology sectors, highlighting the interconnected effects of AI on scientific dissemination and research ecosystems.


Full content

In February, the online image repository DiscoverLife, which contains nearly three million photographs of different species, started to receive millions of hits to its website every day — a much higher volume than normal. At times, this spike in traffic was so high that it slowed the site down to the point that it became unusable. The culprit? Bots.