http://ipkitten.blogspot.com/2024/12/guest-post-ai-training-data-copyright.html
Earlier this week, the IPKat announced the release of the long-awaited UK consultation on Artificial Intelligence (AI) and copyright. Now, we are pleased to host a further commentary by Angela Daly (University of Dundee). Here’s what she writes:
AI training data, copyright and the UK consultation*
by Angela Daly
This week, the UK Government released its latest consultation on AI and copyright, with a particular focus on inputs and outputs of AI models, as covered in this breaking IPKat post. The consultation demonstrates the current ‘betwixt and between’ situation of the UK on this issue post-Brexit: aligned with neither the EU nor the US and trying to reconcile a stand-off between its creative industries on the one hand, and its AI industry on the other (which facilitates the UK’s geopolitical aspirations to be an AI leader) over access to data for training AI models. However, in striking this balance, the voices of creators and performers themselves need to be heard, along with what is in the broader public interest in an era where AI companies are becoming more and more powerful from political and economic perspectives. Although this consultation concentrates on copyright specifically, these bigger issues are also at stake.
Training data
While the consultation addresses other aspects of the AI and copyright relationship, in this post I focus on its relevance to training data, as I have been working on the regulation and governance of data sources for training AI for some time – including but not limited to copyright. Indeed, there are broader debates on regulating AI training data involving
privacy and data protection,
biases in data which can potentially give rise to discrimination/discriminatory outcomes, the
working conditions of the human input needed to train AI and the
environmental costs of AI in terms of energy and other natural resources. Against this backdrop, larger issues emerge as regards the
power wielded by AI companies in the medium to long term especially in a consolidated market, over social, economic and political aspects of our lives.
Data for training AI models can come from
various sources such as internally within an organization, buying data from external sources, open datasets and scraping internet data. The last source, scraping internet data, has been the most controversial so far and given rise to concerns from both intellectual property and data protection perspectives, and nascent litigation in various jurisdictions when the scraping has been done without permission of the copyright/database right holders.
Text and data mining exceptions
One mechanism permitting data scraping can be found in the text and data mining (TDM) exceptions to copyright infringement which can be found in both the UK and EU. The UK has a limited exception for non-commercial TDM, found in
section 29A of the Copyright, Designs and Patents Act 1988, introduced in 2014. The EU equivalents are contained in Articles 3 and 4 of
Directive 2019/790 on Copyright in the Digital Single Market. Prior to the Directive, some Member States (including former Member State, the UK) had similar exceptions in their national laws. Art 3 concerns TDM for scientific research, whereas Art 4 is a broader exception which allows TDM without a restriction on the purpose, so long as content is lawfully accessible and rightsholders have not explicitly opted out of allowing their content to be used for mining purposes.
The UK currently lacks an equivalent of Art 4 of EU CDSM Directive. The
previous UK Government wanted to add an exception for commercial TDM but after a consultation during which rightsholders expressed strong opposition to the proposals, it abandoned the plans in 2023. A
further attempt under the auspices of the UK IPO to devise a code of practice between rightsholders and the AI industry in 2023/4 failed to reach agreement and was also abandoned. With a change of government earlier in 2024, but a consistency in promoting the AI industry for the UK’s post-Brexit global position, it has fallen to the new Labour administration to revive the suggestion in the current consultation.
Litigation
Reopening this issue in the UK is timely as there are at least two attempts to litigate some of the broader issues in the UK courts:
Getty v Stability AI and
Mumsnet v Open AI. Neither of these proceedings appears to involve the UK’s existing non-commercial TDM exception, presumably because the alleged infringing activities are commercial in nature. Instead, in the Getty case (which seems more advanced than the Mumsnet one and looks likely to go to trial in summer 2025),
Stability AI’s defence appears to be based on the UK’s existing fair dealing exception (in particular, pastiche) and claims that no infringement took place as the AI development occurred outside of the UK.
|
From training data to training Kats
|
In any case, even in jurisdictions where commercial TDM exceptions exist, there are questions about precisely which acts these exceptions cover: the data scraping itself, or other activities such as making the models created on the mined data available? We have seen – and the
IPKat has covered – what seems to be the first litigation on AI and the EU TDM exceptions in the German case Kneschke v LAION, where the court found that the activities did fall within the EU’s TDM exceptions. The
decision has been criticized for its incompleteness especially vis-à-vis subsequent acts of making the dataset result from the text and data mining available publicly on its website.
UK consultation
Turning now to the UK’s new consultation, as regards accessing training data for AI, the government makes the following proposals:
- A broad text and data mining exception, similar to that contained in Art 4 Copyright DSM Directive (noting that there is some uncertainty about the operation of that exception) which would allow commercial text and data mining, and allow rightsholders to opt out of TDM and require licences to use their material. The UK Government would seek further clarity on what a rights reservation would entail, and proposes a standardized machine readable format for such reservations for work online.
- Licensing requirements: the UK Government acknowledges that the copyright holders may not be the original creators and performers of works, and requests input about their positions when it comes to agreements between copyright holders and AI developers on training data.
- Transparency requirements: for AI developers as regards the provenance of content on which their models are trained; the UK Government acknowledges the model for this in the EU’s AI Act Article 53(1)(d), and seeks views on whether that model should be followed in the UK. It suggests that introducing new regulatory obligations may be part of a future AI regulation in the UK rather than as an amendment to existing copyright law.
- Clarification of wider copyright law, including models trained in other jurisdictions: implicitly seeking to address the argument from Stability AI in its proceedings against Getty that UK copyright law did not apply as it had trained its model outside of the UK (although there was evidence to the contrary). The Government acknowledges that the territoriality of copyright entails that UK copyright law would not apply to models trained outside the UK but considers that its proposals would make the UK attractive for training and it would also seek to encourage those offering AI models in the UK to comply with UK law, even if this will not be a legal requirement.
The UK Government does seek ‘international interoperability’ in these proposals for UK copyright law. However, rather than the UK leading the way in AI and copyright, it looks like the Government is playing catch-up with the EU, which for better or for worse, has already enacted broad(er) TDM exceptions and AI training data transparency requirements. Differing substantially from these would be ill-advised from the perspective of transnational interoperability. The US approach, based on its fair use defence, which is also in the process of
being litigated, is too distant in form to the UK’s status quo and a proposal to bring UK law in line with the US by e.g. adopting a fair use exception is unlikely to pass muster with the UK’s content industries, which have already come out in favour of strong copyright protections vis-à-vis AI training data. In any case, the reach of TDM exceptions to acts beyond the initial training also needs to be clarified, both in the UK and EU.
Concluding thoughts
As regards training data, the UK’s legal and policy position does need to be clarified. However, opening up more UK data to a commercial TDM exception has so far not proved popular with the content industries and creators. It is unclear what benefit – socially, economically, politically – facilitating easier access to training data will actually achieve for the UK itself, despite the boosterism from the previous and current UK governments around AI. The role and position of creators and performers themselves needs more consideration – for example many academic authors have been outraged at the
decision of major publisher Taylor & Francis to license data to AI firms, which it seems they are legally able to do under the terms of the licences agreed to by authors. Alignment with the EU on at least some of these matters would be pragmatic, but larger – policy – questions remain about the balance of power (and revenue) between creators and performers, copyright holders and AI developers. Moreover a more comprehensive regulatory framework is also needed in the UK, which addresses training data for AI in a more complete fashion.
* This post is based on presentations which Prof Daly has given at the University of Malta Tricontinental IP conference in December 2024 and during her research visit at the University of Stellenbosch Anton Mostert Chair of Intellectual Property in August 2024.
Content reproduced from The IPKat as permitted under the Creative Commons Licence (UK).