
Credit: Peter Dazeley/Getty
Amid fears of research data disappearing for political and technical reasons, scientists, librarians and archivists are calling out weaknesses in how the scholarly record is being preserved.
Teams that run data repositories are being urged to have contingency plans that would protect their data in the event of emergencies. These might include forming networks with other repositories, which could then step in if a failure occurs. Publishers, too, are being asked to use archiving services to ensure the safety of their content.
Research data go missing more often than people might expect. For example, a 2023 study1 of more than 3,000 research-data repositories showed that 191 of them had shut down. Of those, 90 did not maintain access to their data or name a repository that had taken over custody of the data, which could indicate that the data have been lost. “Repository shutdown poses a real threat to the perpetual availability of research data,” write lead author Dorothea Strecker, who researches open science and research-data management at the Humboldt University of Berlin, and her colleagues.
Data repositories such as PubMed, the Web of Science and Scopus are an essential resource for researchers, because they help to improve the accessibility, reliability and reusability of data. But as a March outage of PubMed made clear, these resources are not infallible. A 2015 analysis2 of 326 databases found that more than 60% had disappeared, were non-functional or had limited functionality, and 14% were archived over an 18-year period.
Jonas Recker, an archivist at the Leibniz Institute for the Social Sciences in Mannheim, Germany, says that the possibility of closure can be overlooked by teams that run repositories. “We often don’t think enough about worst-case scenarios, such as loss of funding or the change of the hosting organization’s mission,” he says.
Recker says that the removal of some data sets related to gender and diversity from US government websites in February has reminded the scientific community of the need to build “strong networks” with diverse “geolocation, technology and technical providers, host organizations and funding sources” to ensure that scientific knowledge is preserved.
Distributing data
Many repositories are shut down owing to a lack of funding, technical difficulties or organizational changes, Strecker and her colleagues found. Because preservation “often rests solely on individual repositories”, Strecker says she would like to see a more “distributed” approach to preservation, such as the development of networks that preserve data if a repository is shut down.
Recker agrees, adding that this approach can alleviate the pressure on individual repositories. He recommends that repositories connect with others that are “similar” to them in some way, such as sharing a research discipline, data type or audience. “A group of repositories from the German Network of Educational Research Data is currently working on a ‘cessation checklist’ template, which highlights practical things a repository can do to prepare for a situation where transfer of data to another repository might become necessary,” he says. Having this prepared can help staff to work towards a more formal agreement, such as a memorandum of understanding, says Recker, or help to clarify “which parts of the collection can be transferred, and under which conditions, in case quick action is necessary”.
2024 Research Leaders
Preservation networks are in place for text publications. LOCKSS (Lots Of Copies Keep Stuff Safe), for example, is an open-source program developed by Stanford University in California that makes multiple copies of content and then stores them on the servers of libraries around the world that pay an annual fee to have their collections preserved. ‘Dark’ archives — in which research papers and other content remain mostly inaccessible to the public and are used mainly for restoration or recovery purposes in the event of data corruption or other significant disruptions — are another option. The CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe) dark archive, which is also run out of Stanford, provides access to its archive only in response to specific trigger events, such as the publisher ceasing operations or content being unavailable for an agreed period.
Martin Eve, a researcher in literature, technology and publishing at Birkbeck, University of London, urges publishers to engage with such services. A study he published last year of almost 7.5 million research papers found that more than one-quarter of them had not been properly archived and preserved3. “Publishers need to be brought up to speed on what they need to do to make it work,” he says. “Otherwise, what we’re going to have in 50 years’ time is a bunch of links that we thought were persistent, that have just vanished from use, and a system that’s not great for transmitting knowledge.”
Eve’s research found that wealthier publishers were more likely to have good digital preservation measures than were those that are smaller and less well-resourced — a fact that is not surprising to scholarly-communications researcher Juan Pablo Alperin at Simon Fraser University in Vancouver, Canada. The cost of services such as CLOCKSS (publishers are charged an annual participation fee that is based on publishing revenue) “can be prohibitive for some journals, especially those operating in places such as the global south”, says Alperin. He adds that it’s important that options are available to publishers and repositories in poorer locations.