http://ipkitten.blogspot.com/2023/08/data-science-needs-law-reflections-on.html

Discussions around artificial intelligence (AI), data science, machine learning, text and data mining (TDM), etc. have made things so that everyone most are aware that access to data is imperative for these activities. In this regard, the role of legal frameworks such as copyright law, data protection laws and contract law in regulating and structuring data access is significant. The Data Science for Social Impact, a research lab at the University of Pretoria recently invited me to speak at its seminar series and I used the opportunity to explore, amongst other things, the role that these legal frameworks play in facilitating or restricting data access within the African continent.
 
For starters, a cursory exploration of the research priorities and interests of data science research in Africa shows a multi-faceted range of activities: automatic speech recognition (ASR) in the context of African languages and dialects, machine translation in African languages, diverse and accessible text and voice datasets, crime detection, public health awareness, understanding elections and democratic processes, etc. In many cases, these require access to datasets in African languages.

Need for data: A survey
From a legal perspective, a taxonomy of data used in AI research and development shows that there is copyright data (i.e., copyright-protected materials as data), personal data, contract-based data, etc. To get a sense of the kind of data that is prevalent in AI research within Africa and also to pinpoint the prevalent data sources, I had a set of survey questions as follows: Nature/ kind of data researchers have used?  Source of data; Did you have to pay to access and use this data? 

What kind of data HAVE you used in your research? Please be as specific (e.g. English language text from journal articles; social media posts in X language; etc.) or as brief as you’d like. 

Online Reviews

African languages text data for a variety of text-based NLP projects

English and another SA official language data from government sites and documents

Social media data, online dataset which are publicly available 

English text f4om journal articles

social media posts in English 

English language text from journal articles, own survey data

Affective recognition data

English language text from journal articles

SDG Research Metadata Repositories

YouTube, open-source data

 

Source of data

Where did you get your data from? (e.g. Facebook; corporate website)

Review Website 

Many times we had to curate the data ourselves. For the text-based projects, we got most of the data from news sources

Government website

Online and publicly available data

Database 

Twitter 

University library, own research

Kaggle

UP Library website

Partner Organisation

Yotube, social media

  

Did you pay to access this data?
 
 
Some legal-based “pain points”: A collection of short stories
I then shared 5 different stories to illustrate the (various) roles that law/legal frameworks play (“pain points”?) in data science research. There was another survey after sharing the stories. This new survey sought to identify which of the stories resonated the most with the experiences of the attendees.

Story 1 (Remove infringing and/or offending content but respect the right to freedom of expression): This story was about the interaction between the legal counsel of a new social media platform and the data science team in the same company on striking the right balance between the need to remove inciting content and freedom of speech. This story highlights the need to bridge the knowledge gaps that data scientists and lawyers have with each other’s work and shows how multi-disciplinary dialogue between the two fields may result in the design of better AI systems.

Story 2 (Housing policy could benefit from Airbnb data): This story related to a study that explored the impact of Airbnb on the housing market and housing policy. The researchers grappled with the question of whether licences were needed to scrap data from the Airbnb platform. Given the uncertainties around the UK text and data mining (TDM) exception, UBDC has decided to cooperate with CREATe so that they can proceed with more legal certainty – which led to the commissioning of the paper by Dr Sheona Burro.

Story 3 (Scrap our website for your training data for free but…): Here, some South African data science researchers have shared that they found a web platform that has tons of text and audio files in African languages. The platform owner agreed to grant access to the files for free to use the data to train NLP models. However, in exchange, the platform owner wanted to solely or at least co-own the NLP model created as a result of the access.

Story 4: Some South African researchers needed public health data on the Covid-19 pandemic in a more accessible format but there were concerns from government health departments about who takes responsibility if that data was wrongly labelled.

Story 5 (Of FAIR, CC0 and release of datasets): Many data science projects result in the creation of labelled datasets which in turn necessitates thinking around the management, administration and use of such datasets. A good number of funders in Africa require data scientists to license datasets using Creative Commons (CC) licences. To comply with the FAIR data principles, AI researchers increasingly use the Creative Commons zero (CC0) licence to release their trained datasets for public use and reuse. There are debates around the extent to which this practice could lead to ‘AI data laundering’.

Stories of data science-law interaction that resonate with data scientists in Africa

The second survey sought evidence of the most prevalent of these stories and/or their kind in the experiences of data scientists in Africa.

Conclusion
While the number of survey responses is too small to make conclusive statements, it was interesting to see the prevalent data sources and how copyright protection may apply to some of those data (e.g., journal articles, databases). It was also impactful to see that in over 90% of the cases, researchers did not have to pay to access that data. That said, it would be interesting to understand the extent to which financial constraints dictated the kind of data that researchers use.

Regarding the experiences of researchers on the interface between data science and law, Story 1 resonated the most with respondents’ experiences. Of course, broadly construed ‘inciting or offending content’ could be copyright infringing content. South Africa’s Electronic Communications and Transactions Act refers to ‘unlawful activity’. Story 3 was the second prevalent experience and correlates with global questions on the same issue. If as some scholars have suggested, “extracting information, patterns and correlations from large sets of copyrighted works should not be subject to the authorization of right holders”, then it would not be the place of rightholders to seek (copyright) ownership of AI tools created/developed as a result of such information extraction. But there is the question of the copyright status of labelled datasets which are essentially (in the case of ‘copyright data’) annotated and tagged copyright data in machine-readable formats. These are just a few of the legal uncertainties that researchers have to grapple with.

To respond to the two surveys, which are still open, see here and here. The seminar recording is available here.

Content reproduced from The IPKat as permitted under the Creative Commons Licence (UK).