http://ipkitten.blogspot.com/2025/02/guest-post-uks-ai-and-copyright.html
The IPKat has received and is pleased to host the following commentary from Katfriends Adrian Aronsson-Storrier and Sam Berriman (both Lewis Silkin LLP), pondering on the implications of a potential UK reform of the existing text and data analysis defence in section 29A CDPA and tackling what is often an overlooked angle in copyright debates: data protection law. Here’s what they write:
The UK’s AI and copyright consultation – will data protection law render any commercial TDM exception ineffective?
by Adrian Aronsson-Storrier and Sam Berriman
As regular readers of the IPKat will be aware, the UK government is currently undertaking a consultation on AI and copyright, previously covered here and here. As a part of the consultation, the government has indicated that its preferred ‘proposed approach’ is to introduce a new commercial text and data mining (TDM) exception into UK law, which would apply to data mining for any purpose, but only apply where the relevant copyright owner has not reserved their rights in relation to the work through an agreed mechanism. Such rights reservations are often informally described as ‘opt outs’.
The Information Commissioner’s Office (ICO)
response to its consultation series on generative AI may however undermine – or at least complicate – any expected advantage from the government’s proposed opt out commercial TDM exception in permitting AI developers to “to train on large volumes of web-based material without risk of infringement”. Under the ICO guidance AI developers may be constrained in their ability to lawfully engage in web scraping and may instead need to use licensed personal data sets for their training activities – unless they are able to evidence why they cannot do so – undermining any proposed increased flexibility under copyright law for the AI developers to train on web-based material. Will the ICO’s approach to website scraping be enough to persuade the government to rethink their planned approach to the proposed TDM exception?
The ICO response to the consultation series on generative AI
In January 2024, the ICO
launched a consultation series on how aspects of data protection law should apply to the development and use of generative AI models, and the ICO released their
outcomes report in December 2024. The consultation addressed a range of emerging questions relating to the application of UK data protection law to the development and use of generative AI, including on the question of how data protection law applies to web scraping data to train AI models.
By way of brief overview of data protection law, under the
UK GDPR personal data means any information relating to an identified or identifiable natural person. This can include information such as the name, address or the IP address of a person, their appearance or voice. Such information does not have to be in written form; and photos, audio or video recordings of a person are likely to capture personal data. Equally, a significant portion of online material is also likely to contain personal data, and there is likely to be a significant overlap in relation to specific pieces of content which are protected by both copyright law and by data protection law. For example, a vocal recording in a musical performance is likely to be protected under copyright law and constitute personal data. Similarly, this blog post is protected as a literary work under copyright law, but also contains personal data about us as its authors, including for example our names and where we work.
In order to process personal data a controller must have a lawful basis for doing so. At the outset of the consultation, the
ICO took the preliminary view that only the ‘legitimate interest’ basis for data processing would apply to web scraping to develop AI models, and that training generative AI models on web scraped data could be feasible on the lawful basis where developers could to identify a legitimate interest for processing the web-scraped personal data; where the processing was necessary for that legitimate interest; and where the individual’s interests did not override the interest being pursued. While the ICO had initially considered that “most generative AI training is only possible using the volume of data obtained though large-scale scraping” and that web scraping would therefore meet the necessity test,
the consultation response showed that “the necessity of using web-scraped personal data to train generative AI is not a settled issue”. The ICO noted submissions from the creative industries highlighting the availability of other methods of data collection, such as properly licensed data sets of personal data. Where such licensed data sets exist, the ICO noted that controllers would need to justify why they could not use licensed material as a source of training data, rather than engaging in web scraping.
A consequence of the ICO position is that an AI developer indiscriminately scraping web content to train generative AI may breach the UK GDPR, as such training approaches may not be strictly necessary where licensed sets of personal data are available instead. It may however be open for a developer to argue that it is still necessary to scrape web based data – for example if the licensed data set would not provide enough data for the model outputs to achieve a level of statistical accuracy or where licensing would incur disproportionate costs. In any event, it is important to distinguish between the ability for an owner of a large collection of copyright protected works, such as a newspaper publisher, to enter a copyright licence with AI developers in relation to a collection of works and their ability to licence personal data within that collection. It may be that journalists who have contributed to individual articles in the collection have assigned their copyright, but it may be less clear whether the newspaper have the right to licence their personal data (or the personal data of the subjects of, contributors to, or commenters on an article) for AI training purposes.
What does this mean for the AI and copyright consultation?
The preferred approach of the UK government in the AI and copyright consultation of introducing a commercial TDM exception subject to rightholder opt out is designed to “support AI innovation” and change the law to permit AI companies “wide and lawful access to high-quality data” through web scraping without the risk of copyright infringement. If, however, the regulatory position on data protection means that in practice AI developers may not be able to train their models on web-scraped content and must in practice enter into licensing agreements to obtain training data, there is seemingly no practical benefit of requiring copyright owners to implement opt-outs. Instead, any opt-out TDM exception would only impose additional transaction costs and complexity for both rightholders and AI developers. For copyright owners this would include costs of developing and implementing rights reservation standards and then implementing those rights reservations on their content. For AI developers the additional costs would include set up costs of developing rights reservation standards and ensuring that these were respected by their AI training tools. On the other hand, if the ICO position is such that they accept it may still be necessary in many instances to scrape web based data – despite the existence of data sets which can be licensed – the flexibility offered by the TDM exception is still somewhat hampered by the additional compliance piece for developers needing to further evidence why other sources of data cannot be used.
The ICO outcomes report therefore acts as a timely reminder that copyright is not the only area of law that needs to be considered in relation to the legality of website scraping for AI development, and that a government reform approach that looks to introduce only a copyright and database rights exception may not achieve its intended benefits of “supporting AI innovation”. Indeed,
as we have previously written in the context of AI generated outputs and synthetic data markets, where online content is not protected by copyright (for example, if it lacks originality) but is protected by restrictive contractual terms which forbid AI training on that content, a copyright exception will not act as a defence to a breach of contract claim. This is a live issue,
with recent allegations that the Chinese AI company
DeepSeek may have used outputs from OpenAI’s models when training their own in a contractual breach of the OpenAI Terms of Use, rather than any claims being made under copyright law.
These complications also raise an open question as to whether the UK approach to AI regulation is sufficient. The
AI Opportunities Action Plan considers the notion of establishing of a central body, but with an intention of having a “mandate and higher risk tolerance to promote innovation across the economy” including powers to issue pilot regulatory sandbox licenses which override sector regulations for AI products. Arguably, the regulatory disconnect on issues such as web scraping highlights the need for at least some central guiding body to join the regulatory dots, rather than merely promoting innovation or overriding sectoral regulations. In the meantime, the
ICO announced a new AI strategy and Code of Practice for “those who are developing AI products” which will “allow them to innovate and invest responsibly while safeguarding people’s rights”. While the ICO may refine their position, only time will tell whether this accommodates the wider appreciation for the challenges faced by AI developers.
Content reproduced from The IPKat as permitted under the Creative Commons Licence (UK).