In the realm of intellectual property regulations tailored for artificial intelligence, we find ourselves amidst a transformational journey. What was once a realm of speculation has now erupted into a blazing confluence of aspirations and apprehensions about the groundbreaking advancements AI has ushered in. The enigmatic nature of these intelligent systems, which already outshine human capabilities in select domains, and the subsequent need for their governance is a subject of significant ambiguity. The trajectory we choose to navigate in terms of safeguarding and managing this technology holds the key to realizing the optimistic potential of AI in domains such as science, medicine, and overall societal betterment, triumphing over the prevailing doomsday fears.
The advent of AI chatbots like OpenAI’s ChatGPT over the past year has incited noteworthy apprehensions. They span a spectrum from Senate Majority Leader Chuck Schumer of New York State, asserting that AI will revolutionize not only workplaces and classrooms but also the very fabric of our lives, to Russian President Vladimir Putin’s declaration that mastery in this domain equates to global dominion. Such concerns are also echoed by industry leaders, flagging the dire consequences of unchecked AI.
Efforts on the legislative front to tackle these concerns are already underway. On June 14, the European Parliament endorsed a fresh Artificial Intelligence Act, subsequent to incorporating 771 modifications to a 69-page proposal put forth by the European Commission. This Act mandates that “generative” AI systems like ChatGPT implement an array of protective measures and disclosures. These encompass restrictions on the utilization of techniques that operate below human consciousness or exploit specific groups based on factors like age or disability. Moreover, these systems must steer clear of risks that jeopardize health, safety, fundamental rights, environmental integrity, and democratic principles.
A pertinent global question revolves around whether the data employed to train AI systems necessitates authorization from authors or performers who also seek due recognition and compensation for their intellectual contributions.
Several governments have introduced special exceptions within copyright law to facilitate the collection and application of data for AI training. These exceptions enable certain systems to be trained on online content, including texts and images, owned by others. Nonetheless, these exceptions have encountered opposition, particularly from copyright holders and critics with broader contentions, aiming to curtail or dilute these services. This friction adds to the ongoing debates surrounding AI’s potential biases, social manipulation, economic ramifications, misinformation, fraud, and other hazards. Even catastrophic predictions about the obliteration of humanity have been voiced.
Recent copyright hearings in the United States resonate with a common refrain among authors, artists, and performers: AI training data should adhere to the “three C’s” – consent, credit, and compensation. Each of these facets poses practical challenges that diverge from the lenient text and data mining exceptions embraced by some nations.
Global approaches to intellectual property linked with training data exhibit diversity and evolution. In the United States, multiple lawsuits are underway to ascertain the extent to which the fair use exception applies to copyright in this context. The European Union’s 2019 Directive on copyright within the digital single market incorporated text and data mining exceptions, encompassing a mandatory exception for research and cultural entities, while granting copyright holders the authority to prevent commercial service usage of their works. The United Kingdom proposed an expansive exception in 2022, intended for commercial purposes, although this proposal was temporarily halted earlier this year. Singapore introduced a copyright law exception in 2021 for computational data analysis, spanning text and data mining, data analytics, and machine learning. Notably, Singapore’s exception mandates lawful access to data and cannot be overridden by contractual agreements. China has signaled its intent to exclude from training data any content infringing intellectual property rights. A Stanford University DigiChina project article from April characterized this stance as “somewhat opaque,” considering the often vague copyright status of data culled from diverse online sources at massive scales. Numerous countries lack specific exceptions for text and data mining, yet remain uncommitted in their position. Indian authorities have indicated a lack of readiness to regulate AI at present but, like many other nations, India is eager to foster its domestic AI sector.
In the journey of formulating laws and regulations, a cautious approach is imperative to prevent a one-size-fits-all framework. Lessons from past legislative endeavors concerning databases underscore the necessity for prudence. In the 1990s, propositions circulated to confer automatic rights to information extracted from databases, including non-copyrighted elements such as statistics. The World Intellectual Property Organization’s 1996 treaty proposal exemplified this trend. Within the United States, a diverse coalition of academics, libraries, amateur genealogists, and public interest groups opposed this treaty. However, the opposition from U.S. companies like Bloomberg, Dun & Bradstreet, and STATS proved more pivotal. They deemed the database treaty unnecessary and burdensome, foreseeing increased licensing complexities for data they needed to procure and provide to customers, potentially leading to undesirable monopolies. The WIPO database treaty faltered in a 1996 diplomatic conference, as did subsequent attempts in the U.S. However, the European Union progressed to implement a directive concerning legal protection for databases. In the subsequent years, the U.S. witnessed a surge in database investments, while the European Union sought to dilute its directive through judicial decisions. Internal evaluations in 2005 determined that this instrument had little proven impact on database production.
Practicality sheds light on an additional caveat. The sheer magnitude of data in extensive language models can be daunting. The initial version of Stable Diffusion, which generates visual content from textual input, required training on 2.3 billion images. GPT-2, an earlier iteration powering ChatGPT, was trained on 40 gigabytes of data. Subsequently, GPT-3 underwent training on a staggering 45 terabytes of data, surpassing its predecessor’s scale by over a thousandfold. OpenAI, confronted with legal actions regarding data usage, has refrained from publicizing the precise dimensions of the dataset employed to train the latest iteration, GPT-4. Even for simple projects, securing rights to copyrighted content can prove intricate. For extensive undertakings or platforms, identifying rightful copyright holders can be near-impossible, given the practical hurdles of tracing metadata and evaluating agreements among authors, performers, and publishers. In the domain of scientific research, requisites for obtaining consent to use copyrighted material could potentially bestow significant bargaining power to article publishers, despite authors often receiving no remuneration.
Delineations between ownership bear significance. The implications of a copyright holder of a popular music recording choosing to exclude their content from a database differ from an essential scientific paper being excluded due to licensing disputes. Particularly in contexts such as hospitals and gene therapy, the exclusion of pertinent information from training databases raises concerns.
Beyond consent, the other two “C’s” – credit and compensation – also pose challenges, as underscored by ongoing litigation concerning copyright and patent violations. Nevertheless, there is potential for datasets and applications in fields like art and biomedical research where a well-orchestrated AI program could facilitate equitable benefit sharing. An example is the proposition of an open-source dividend to reward the creation of successful biomedical products.
In certain cases, data used for AI training can be decentralized, incorporating an array of safeguards. These include robust privacy protection, avoidance of monopolistic control, and the application of “dataspaces” strategies currently being developed for scientific data.
The overarching challenge to intellectual property rights associated with training data is glaring: these rights remain inherently national, while the global race to advance AI services marches on. The geographical location for running AI programs is only dictated by access to electricity and the internet. Extensive staff or specialized facilities are unnecessary requisites. Entities operating within jurisdictions that impose arduous or impractical data-related obligations will inevitably contend against counterparts functioning within more permissive environments.