No one knows for sure exactly what ChatGPT — the most famous product of artificial intelligence — and similar tools were trained on. But millions of academic papers scraped from the web are among the reams of data that have been fed into large language models (LLMs) that generate text, and similar algorithms that make images (see Nature 632, 715–716; 2024). Should the creators of such training data get credit — and if so, how? There is an urgent need for more clarity around the boundaries of acceptable use.
Few LLMs — even those described as ‘open’ — have developers who are upfront about exactly which data were used for training. But information-rich, long-form text, a category that includes many scientific papers, is particularly valuable. According to an investigation by The Washington Post and the Allen Institute for Artificial Intelligence in Seattle, Washington, material from the open-access journal families PLOS and Frontiers features prominently in a data set called C4, which has been used to train LLMs such as Llama, made by the technology giant Meta. It is also widely suspected that, just as copyrighted books have been used to train LLMs, so have non-open-access research papers.
Science and the new age of AI: a Nature special
One fundamental question concerns what is allowed under current laws. The World Intellectual Property Organization (WIPO), based in Geneva, Switzerland, says that it is unclear whether collecting data or using them to create LLM outputs is considered copyright infringement, or whether these activities fall under one of several exemptions, which differ by jurisdiction. Some publishers are seeking clarity in the courts: in an ongoing case, The New York Times has alleged that the tech firms Microsoft and OpenAI — the company that developed ChatGPT — copied its articles to train their LLMs. To avoid the risk of litigation, more AI firms are now, as recommended by WIPO, purchasing licences from copyright holders for training data. Content owners are also using code on their websites that tells tools scraping data for LLMs whether they are allowed to do so.
Things get much fuzzier when material is published under licences that encourage free distribution and reuse, but that can still have certain restrictions. Creative Commons, a non-profit organization in Mountain View, California, that aims to increase sharing of creative works, says that copying material to train an AI should not generally be treated as infringement. But it also acknowledges concerns about the impact of AI on creators, and how to ensure that AI that is trained on ‘the commons’ — the body of freely available material — contributes to the commons in return.
These broader questions of fairness are particularly pressing for artists, writers and coders, whose livelihoods depend on their creative outputs and whose work risks being replaced by the products of generative AI. But they are also highly relevant for researchers. The move towards open-access publishing explicitly favours the free distribution and reuse of scientific work — and this presumably applies to LLMs, too. Learning from scientific papers can make LLMs better, and some researchers might rejoice if improved AI models could help them to gain new insights.
Credit where it is due
But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.
Has your paper been used to train an AI model? Almost certainly
In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.
Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms.
Greater transparency can also play a part. The EU’s AI Act, which came into force on 1 August, requires developers to publish a summary of the works used to train their AI models. This could bolster creators’ ability to opt out, and might serve as a template for other jurisdictions. But it remains to be seen how this will work in practice.
Meanwhile, research should continue into whether there is a need for more-radical solutions, such as new kinds of licence or changes to copyright law. Generative AI tools are using a data ecosystem built by open-source movements, yet often ignore the accompanying expectations of reciprocity and reasonable use, says Sylvie Delacroix, a digital-law scholar at King’s College London. The tools also risk polluting the Internet with AI-generated content of dubious quality. By failing to redirect users to the human-made sources on which they were built, LLMs could disincentivize original creation. Without putting more power into the hands of creators, the system will come under severe strain. Regulators and companies must act.