Saturday 18 May 2024

Will the AI Act Bring More Clarity to the Regulation of Text and Data Mining in the EU?



Maryna Manteghi, PhD researcher, University of Turku, Finland


Photo credit: mikemacmarketing and Liam Huang, on Flickr via Wikimedia Commons





The Artificial Intelligence Act (AIA), “the first-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally” (according to the European Commission), contains two provisions which are relevant to copyright. In particular, Article 53 (1) (c) (d) requires providers of general-purpose AI models first, to comply with “Union law on copyright and related rights…in particular to identify and comply with…a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790,” and second, to “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model…”. The provisions have been added to the text of the Act to address the risks associated with the development and exploitation of generative AI (GenAI) models such as ChatGPT, MidJourney, Dall-E, GitHub Copilot and others (see the Draft Report of the European Parliament).


TDM in the context of copyright


AI systems have to be trained on huge amounts of existing data including copyright-protected works to be able to perform a wide range of challenging tasks and generate different types of content (e.g., texts, images, music, computer programs etc.,) (for technical aspects see e.g., Avanika Narayan et al). In other words, GenAI models have to learn the inherent characteristics of real-world data to generate creative content on demand. AI developers employ various automated analytical techniques to train their systems on actual data. One example is text and data mining (TDM), the concept which involves techniques and methods needed to extract new knowledge (e.g., patterns, insights, trends etc.,) from Big Data (for a general overview of TDM techniques and methods see e.g., Jiawei Han et al). A computer typically makes copies of collected works to be able to mine (train) AI algorithms.


TDM requires processing of huge amounts of data, thus training datasets may also contain copyright-protected works (e.g., books, articles, pictures, etc.,). However, unauthorised copying of protected works may potentially infringe one of the exclusive rights of copyright holders, in particular the right to reproduction granted to authors under Article 2 of the Directive on copyright in the information society (the InfoSoc Directive). To prevent the risk of copyright infringement, providers of GenAI have to negotiate licenses over protected works or rely on a so-called “commercial” TDM exception provided under Art. 4 of EU Directive 2019/790 on copyright in the digital single market (CDSM), which, as we have seen above, is referred to in the AI Act. The provision has been adopted alongside the “scientific research” TDM exception (Art. 3 of CDSM) to provide more legal certainty specifically for commercially operating organisations.


However, providers of GenAI models have to meet two-fold requirements to enjoy the exception of Art. 4 of CDSM. First, they need to obtain “lawful access” to data they wish to mine through contractual agreements, and subscriptions, based on open access policy or through other lawful means, or use only materials which are freely available online (Art. 4 and Recital 14 of CDSM). Second, AI developers have to check whether rightholders have reserved the use of their works for TDM by using machine-readable means, including metadata and terms and conditions of a website or a service or through contractual agreements or unilateral declarations, or not (Art. 4 (3) and Recital 18 of CDSM).


The copyright-related obligations of the AI Act: a closer look


It appears that Article 53 (1) (c) of the Artificial Intelligence Act ultimately dispelled all doubts regarding the relevance of Article 4 of CDSM to AI training by obliging providers of GenAI to comply with the reservation right granted to rightholders under this provision. The arguments in favour of this idea could also be derived from the broad definition of TDM included in the text of CDSM (“any automated analytical technique aimed at analysing text and data in digital form in order to generate information…” Article 2 (2) CDSM) and the aim of Article 4 of CDSM that is to enable the use of TDM by both public and private entities for various purposes, including for the development of new applications and technologies (Recital 18 of CDSM) (see e.g., Rosati here and here; Ducato and Strowel; and Margoni and Kretschmer).


Further, the new transparency clause of the AI Act requiring providers of GenAI models to reveal data used for pre-training and training of their systems (Article 53 (1) (d) of AIA and recital 107) could also bring more certainty in the context of AI training and copyright. Recital 107 of the Act clarifies that providers of GenAI models would not be required to provide a technically detailed summary of sources where mined data were scraped but it would be sufficient to list “the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used”. This clarification could make the practical implementation of the transparency obligation less burdensome for AI developers taking into account huge masses of data used for mining (training) of AI algorithms. The transparency obligation under Article 53 (1) (d) of the Act would allow rightholders to determine whether their works have been used in training datasets or not and if needed, opt out of them. Therefore, the provision would literarily enable the work of an “opt-out” mechanism of Article 4 (3) of CDSM.


However, the “commercial” TDM exception may not be a proper solution for AI developers as their ability to train (and thus develop) their systems would depend on the discretion of rightholders. What does it exactly mean? Put simply, there are some issues which could restrict or even prohibit the application of TDM techniques. First, the exception can be overridden by a contract under Article 7 of the CDSM Directive. Second, rightholders may restrict access to their works for TDM by not issuing licenses or raising licensing/subscription fees. Moreover, even if users would be lucky enough to obtain “lawful access” to protected works rightholders can prohibit TDM in contracts, terms and conditions of their websites or by employing technological protection measures. Third, rightholders may employ an “opt-out” mechanism to reserve the use of their works for TDM, thereby obliging TDM users to pay twice- first to acquire “lawful access” to data and a second time to mine (analyse) it (see Manteghi). In this sense, rightholders literally would control innovation and technological progress in the EU as the development of AI technologies heavily relies on TDM tools.


Concluding thoughts


To sum up, the copyright-related obligations of the AI Act could alleviate (to some extent) the conflict of interest between copyright holders and providers of GenAI models, providing that training of AI models should be covered by the specific copyright exception and be subject to a transparency obligation would bring more clarity to the regulation of AI development. However, major concerns remain regarding the excessive power granted to rightholders under the “lawful access” requirement and the right to reservation of Article 4 of CDSM. The author of this blog does not support the idea of making copyright-protected works freely available for everyone but rather wants to emphasise the risks of the deceptively broad “commercial” TDM exception. The future of AI development, innovation and research should not be left at the discretion of copyright holders. The purpose of AI training is not to directly infringe copyright holders' exclusive rights but to extract new knowledge for developing advanced AI systems that would benefit various fields of our lives. Therefore, the specific TDM exceptions should balance the competing interests in practice and not tip the scales in favour of a particular stakeholder that would only create more tension in the rapidly evolving algorithmic society.

No comments:

Post a Comment