<100 subscribers
Share Dialog
Share Dialog
On June 28, 2023, the first representative ChatGPT copyright infringement lawsuit finally appeared in the public eye. Two authors launched a copyright class action lawsuit in the U.S. District Court for the Northern District of California against Open AI, Inc. alleging the latter's unauthorized use of their own copyrighted books to train ChatGPT for commercial gain.
Plaintiffs Paul Tremblay and Mona Awad, who reside in Massachusetts, own the copyrights to the works at issue, The Cabin at the End of the World and 13 Ways of Looking at a Fat Girl and Bunny, respectively; and Defendant Open AI creates and operates the generative AI product ChatGPT, which is currently powered by two underlying large language models, GPT-3.5 and GPT-4.
The Complaint states that although the Plaintiffs did not authorize Open AI to use their copyrighted books for model training, ChatGPT was able to output summaries of the books based on prompts commands, which was only possible if the Defendants had included the books in the corpus for training.
01 "Caught" outputting book summaries
The plaintiffs claimed that a large amount of the content included in Open AI's training dataset was a copyrighted work, including the plaintiffs' copyrighted books. However, Open AI did not obtain the plaintiff's consent, nor did it identify the source of the content or pay the necessary fees. Plaintiffs' books have clear copyright management information, including the publication number, copyright number, name of the copyright holder, and terms of use.
Plaintiffs can infer from the facts and information available to them that the only explanation for ChatGPT's ability to accurately generate summaries of specific books is that Open AI acquired and copied the books at issue and used them for training its large language models (GPT3.5 or GPT4).
Plaintiffs' testing found that ChatGPT could produce more accurate summaries (although there were a few errors) when asked to summarize two of the books in question by means of prompts. This indicates that ChatGPT preserves the content of specific works in the trained dataset and is able to output the corresponding text. At the same time, ChatGPT, through the design of the content generation principle of the large language model, the output content does not contain the original copyright management information.
02 "ChatGPT, how you run!"
What is interesting about this case is that Plaintiffs' proof of Open AI infringement is based on a conversation with ChatGPT to "introduce themselves" to ChatGPT's fundamentals. The details are summarized below.
Open AI has so far made public a series of large language models, including GPT-1 (2018-6), GPT-2 (2019-2), GPT-3 (2020-5), GPT-3.5 (2022-3), and the latest GPT-4 (2023-3). Generally speaking, AI software aims to simulate human logic and reasoning through algorithms with the help of statistical methods. Large language models, on the other hand, are a specialized class of AI software used to parse and output natural language.
On the one hand, Open AI makes ChatGPT available to users via a web page for $20 per month. Users can choose between two versions of ChatGPT, the GPT-3.5 model or the updated GPT-4 model. On the other hand, ChatGPT is also available to software developers in the form of an API, which allows developers to write programs for exchanging data with ChatGPT, in which case they are billed on a per-use basis.
Regardless of whether the service is provided as a page or as an API, ChatGPT responds positively to user requests for prompts. If a user asks ChatGPT a question, it will give an answer; if a user gives a command to ChatGPT, ChatGPT will execute it; if a user asks ChatGPT to summarize the abstract of a book, ChatGPT will still do it.
03 Books are the core corpus for big model training
Plaintiffs focus on the fact that, unlike traditional software, which is developed by engineers writing code, big language models are developed by "training"-collecting a huge corpus of content from different sources and "feeding" it to the model, called a training dataset. "to the model, which is called a training dataset.
The big language model constantly adjusts its output to be as close as possible to the order of text combinations in the trained work. It is interesting to note that while a lot of content has been used to train the Big Language Model, books have been the core corpus material in the training dataset because they provide the best examples of high-quality long-form writing.
In the corporate paper "Improving language comprehension through generative pre-training" published in June 2018, Open AI disclosed that the training of GPT-1 relies on the "BookCorpus" dataset. "BookCorpus" contains 7,000 books in various fields such as adventure, fantasy, romance, etc. Open AI pointed out that books are particularly important as a training corpus because they contain long continuous text, which allows generative models to learn how to process long textual information.
A number of AI research and development companies, including Open AI, Google, Amazon and others, have utilized "BookCorpus" for model training, and in 2015, a team of AI researchers created the dataset, which contains books sourced from the Smashwords.com website, but the data is not available to the public. "BookCorpus did not obtain authorization from the copyright holder for the inclusion of these books.
04 Uncovering the Book Corpus Behind the GPT
By means of public searches of Open AI's unsolicited disclosures (the corporate papers), the plaintiffs hope to argue that the training of the GPT series of models is based on, among other things, the unauthorized infringing use of massive amounts of book content. In its July 2020 corporate paper, "Language Models as Small-Sample Learners," Open AI disclosed that 15% of the content in the GPT-3 training dataset was derived from two e-book corpora named "Books1" and "Books2. e-book corpus.
Although Open AI did not explain the specifics of the contents of "Books1" and "Books2", it can be inferred from the relevant clues: firstly, both corpora are from the web; secondly, the size of both corpora is significantly larger than that of "BookCorpus"; and secondly, the size of both corpora is significantly larger than that of "BookCorpus". "BookCorpus". According to Open AI's disclosure, "Books1" is 9 times larger than BookCorpus (about 63,000 books) and Books2 is 42 times larger (about 294,000 books). In reality, very few databases are able to provide a corpus of books of this size. On the one hand, it is highly probable that Books1 is derived from Project Gutenberg or the Gutenberg Corpus Standardization Project. "Project Gutenberg is an online library of e-books "beyond the term of copyright protection", and in September 2020, Project Gutenberg announced that it had included more than 60,000 books. Because it is not copyrighted, Project Gutenberg has been widely used for AI model training, and in 2018, a team of AI researchers built on Project Gutenberg to create the Standardized Gutenberg Corpus of over 50,000 books. "(Standardized Project Gutenberg Corpus). On the other hand, it is very likely that "Books2" originates from "shadow libraries" on the Internet. The "Books2" dataset contains about 29,400,000 books, and only the much-criticized "shadow libraries" are able to provide a book corpus of this size. Examples include Library Genesis, Z-Library, Sci-Hub, and Bibliotik. The term "shadow library" was coined by the U.S. Social Science Research Council, in its 2011 article "Media Piracy in Emerging Economies," to refer to websites that infringingly include large numbers of books and make them available to the public for free.2023 In March, Open AI released the GPT-4 Enterprise paper, but stated that "The structure and content of the training dataset will no longer be disclosed in relation to the industry's competitive situation and product application security perspective."
05 Six Allegations of Infringement Facing Open AI
Plaintiffs have launched a total of six counts against Open AI, the first three involving copyright infringement, the fourth involving unfair competition, and the fifth and sixth involving two basic categories of civil liability-duty of care and unjust enrichment.
First, direct copyright infringement. Plaintiffs did not authorize Open AI to make copies of their books, to make renditions, or to publicly display or distribute said copies or renditions.
In addition, Plaintiffs emphasize that because Open AI's Big Language Model requires the extraction and preservation of expressive information from Plaintiffs' books in order to function, the Big Language Model itself constitutes an infringing deductive work in the absence of Plaintiffs' authorization.
Second, Copyright Substitution Infringement. Plaintiffs emphasize that in the absence of authorization, each output of the Big Model constitutes an infringing deductive work. Because of the right and ability to control the content output of the Big Language Model and the financial benefit derived from it, Open AI constitutes copyright alternative infringement.
Under the U.S. case law system, "vicarious infringement" and "aiding and abetting infringement" together constitute a complete system of indirect copyright infringement. Indirect infringement, as opposed to direct infringement, means that although the infringer does not directly engage in the behavior regulated by copyright exclusive rights (i.e., direct copyright infringement), it provides certain conditions for direct copyright infringement.
Third, violation of the copyright management information provisions of the DMCA. From the perspective of the product design mechanism, ChatGPT's output does not retain the "copyright management information" (CMI) of the work, so Defendant's intentional removal of Plaintiff's work's CMI violates the provisions of the Digital Millennium Copyright Act (DMCA). In addition, Defendants' unauthorized distribution of infringing renditions of the Works that do not contain CMI also violates the DMCA.
"Copyright Management Information" is information that identifies the owner of the rights to the work in question, the ownership of the rights, and the conditions of use. Whether in the United States or in this country, it is a violation of the law to remove or alter copyright management information, or to make available to the public a work from which copyright management information has been removed or altered.
Fourth, unfair competition.Open AI's unauthorized use of Plaintiffs' copyrighted works for model training violates the California Business and Professions Code because it is improper, unethical, coercive, and injurious to consumers.
Defendants intentionally designed ChatGPT to output snippets and excerpts of Plaintiff's works without attribution.ChatGPT develops commercial products to gain unfair advantage and fame by concealing authorship and reproducing the content and ideas of infringing works.
Fifth, negligent infringement is a breach of the duty of care.Open AI is subject to a duty of care under the California Civil Code-that an owner should act in a reasonable manner with respect to others. This duty is based on industry custom, business practice, the information available to the defendant, and the ability to control based on that information.
Once a defendant collects a plaintiff's copyrighted works for the purpose of training a GPT model, it owes a duty of care not to make infringing use of the works when it foresees that unauthorized use of the works for modeling purposes will cause harm to the plaintiff.
Sixth, unjust enrichment. Plaintiff for the creation of the book in question to pay a substantial amount of time and energy. Because their works were used without authorization to train GPT models, Plaintiffs were deprived of the right to profit from their works as they otherwise would have. It is unfair to Defendants to occupy the commercial benefits of using Plaintiffs' works to train GPT models. Unless enjoined or restrained, Defendants' conduct will cause Plaintiff irreparable harm.
In conclusion: three issues to be explored in this case.
As the first representative action for ChatGPT copyright infringement, a formal judgment from the Northern District of California will be a long process. Until then, however, there are still a number of issues that deserve attention and consideration in light of the specifics of the plaintiff's complaint.
Concern #1: Discovering model infringement is not easy.
The training of large language model is essentially a machine internal, non-explicit works utilization behavior, copyright owners exist to discover their own works are infringed the reality of the problem. Generally speaking, the copyright owner can only find out that there is unauthorized exploitation of his/her work during the training stage of the model by comparing the content generated by the model and his/her own work, which is substantially similar to each other. In this case, the plaintiff was able to allege that his book was infringed by Open AI's large language model, which was deduced from the discovery that ChatGPT had outputted a summary of his work.
However, whether this claim is valid remains to be explored. If ChatGPT's output of summaries of its own work is based on its own collection of public descriptions of the Plaintiffs' books on the Internet, rather than on direct copying and training of the Plaintiffs' books, the legitimacy of the infringement allegation would be shaken. Plaintiff also admits that ChatGPT's output of its own book summaries contains a small number of factual errors, suggesting to some extent that the larger model may not have learned the book in its entirety.
Concern #2: What rights are being infringed is up for debate.
At present, although "the act of storing work data" can formally fall into the scope of regulation of "copy right" under the copyright law, the core "act of training work data", whether it is an infringement of copyright and what kind of copyright infringement it is. However, there is no consensus on whether the core "act of training work data" is an infringement of copyright and what kind of copyright rights it infringes. In this case, the plaintiff emphasized that the normal operation and content output of the large language model is based on the training of the work corpus, so the training of the large model constitutes copyright infringement, and the large model itself constitutes an infringing rendition of the work.
This claim also remains to be explored. In addition to a few cases similar to this case, "to promote the way to request to summarize, summarize, translate specific copyrighted works" and such special content generation needs, the vast majority of cases in which the big model receives open-ended content generation commands (not limited to specific works, specific writers' styles), basically will not output a specific work, or even a specific work's In most cases, big models receive open-ended content generation commands (without limiting to specific works or styles of authors), and basically do not output specific works or even fragments of specific works, which does not constitute copyright infringement.
Concern 3: Upstream and downstream responsibilities need to be clarified.
In the field of big model copyright, the model developer enjoys the relevant rights for the big model itself, so it bears the copyright responsibility involved in the model training; and for the content output from the big model, from the current industry practice, the prevailing practice is to make it clear that the rights and responsibilities belong to the users by way of contract. On July 10, 2023, the Interim Measures for the Administration of Generative Artificial Intelligence Services issued by the Office of the Internet Information Office also explicitly endorsed that "the provider should sign a service agreement with the user to clarify the rights and obligations of both parties."
It is worth paying attention to, from the plaintiff's lawsuit request, also follow the
On June 28, 2023, the first representative ChatGPT copyright infringement lawsuit finally appeared in the public eye. Two authors launched a copyright class action lawsuit in the U.S. District Court for the Northern District of California against Open AI, Inc. alleging the latter's unauthorized use of their own copyrighted books to train ChatGPT for commercial gain.
Plaintiffs Paul Tremblay and Mona Awad, who reside in Massachusetts, own the copyrights to the works at issue, The Cabin at the End of the World and 13 Ways of Looking at a Fat Girl and Bunny, respectively; and Defendant Open AI creates and operates the generative AI product ChatGPT, which is currently powered by two underlying large language models, GPT-3.5 and GPT-4.
The Complaint states that although the Plaintiffs did not authorize Open AI to use their copyrighted books for model training, ChatGPT was able to output summaries of the books based on prompts commands, which was only possible if the Defendants had included the books in the corpus for training.
01 "Caught" outputting book summaries
The plaintiffs claimed that a large amount of the content included in Open AI's training dataset was a copyrighted work, including the plaintiffs' copyrighted books. However, Open AI did not obtain the plaintiff's consent, nor did it identify the source of the content or pay the necessary fees. Plaintiffs' books have clear copyright management information, including the publication number, copyright number, name of the copyright holder, and terms of use.
Plaintiffs can infer from the facts and information available to them that the only explanation for ChatGPT's ability to accurately generate summaries of specific books is that Open AI acquired and copied the books at issue and used them for training its large language models (GPT3.5 or GPT4).
Plaintiffs' testing found that ChatGPT could produce more accurate summaries (although there were a few errors) when asked to summarize two of the books in question by means of prompts. This indicates that ChatGPT preserves the content of specific works in the trained dataset and is able to output the corresponding text. At the same time, ChatGPT, through the design of the content generation principle of the large language model, the output content does not contain the original copyright management information.
02 "ChatGPT, how you run!"
What is interesting about this case is that Plaintiffs' proof of Open AI infringement is based on a conversation with ChatGPT to "introduce themselves" to ChatGPT's fundamentals. The details are summarized below.
Open AI has so far made public a series of large language models, including GPT-1 (2018-6), GPT-2 (2019-2), GPT-3 (2020-5), GPT-3.5 (2022-3), and the latest GPT-4 (2023-3). Generally speaking, AI software aims to simulate human logic and reasoning through algorithms with the help of statistical methods. Large language models, on the other hand, are a specialized class of AI software used to parse and output natural language.
On the one hand, Open AI makes ChatGPT available to users via a web page for $20 per month. Users can choose between two versions of ChatGPT, the GPT-3.5 model or the updated GPT-4 model. On the other hand, ChatGPT is also available to software developers in the form of an API, which allows developers to write programs for exchanging data with ChatGPT, in which case they are billed on a per-use basis.
Regardless of whether the service is provided as a page or as an API, ChatGPT responds positively to user requests for prompts. If a user asks ChatGPT a question, it will give an answer; if a user gives a command to ChatGPT, ChatGPT will execute it; if a user asks ChatGPT to summarize the abstract of a book, ChatGPT will still do it.
03 Books are the core corpus for big model training
Plaintiffs focus on the fact that, unlike traditional software, which is developed by engineers writing code, big language models are developed by "training"-collecting a huge corpus of content from different sources and "feeding" it to the model, called a training dataset. "to the model, which is called a training dataset.
The big language model constantly adjusts its output to be as close as possible to the order of text combinations in the trained work. It is interesting to note that while a lot of content has been used to train the Big Language Model, books have been the core corpus material in the training dataset because they provide the best examples of high-quality long-form writing.
In the corporate paper "Improving language comprehension through generative pre-training" published in June 2018, Open AI disclosed that the training of GPT-1 relies on the "BookCorpus" dataset. "BookCorpus" contains 7,000 books in various fields such as adventure, fantasy, romance, etc. Open AI pointed out that books are particularly important as a training corpus because they contain long continuous text, which allows generative models to learn how to process long textual information.
A number of AI research and development companies, including Open AI, Google, Amazon and others, have utilized "BookCorpus" for model training, and in 2015, a team of AI researchers created the dataset, which contains books sourced from the Smashwords.com website, but the data is not available to the public. "BookCorpus did not obtain authorization from the copyright holder for the inclusion of these books.
04 Uncovering the Book Corpus Behind the GPT
By means of public searches of Open AI's unsolicited disclosures (the corporate papers), the plaintiffs hope to argue that the training of the GPT series of models is based on, among other things, the unauthorized infringing use of massive amounts of book content. In its July 2020 corporate paper, "Language Models as Small-Sample Learners," Open AI disclosed that 15% of the content in the GPT-3 training dataset was derived from two e-book corpora named "Books1" and "Books2. e-book corpus.
Although Open AI did not explain the specifics of the contents of "Books1" and "Books2", it can be inferred from the relevant clues: firstly, both corpora are from the web; secondly, the size of both corpora is significantly larger than that of "BookCorpus"; and secondly, the size of both corpora is significantly larger than that of "BookCorpus". "BookCorpus". According to Open AI's disclosure, "Books1" is 9 times larger than BookCorpus (about 63,000 books) and Books2 is 42 times larger (about 294,000 books). In reality, very few databases are able to provide a corpus of books of this size. On the one hand, it is highly probable that Books1 is derived from Project Gutenberg or the Gutenberg Corpus Standardization Project. "Project Gutenberg is an online library of e-books "beyond the term of copyright protection", and in September 2020, Project Gutenberg announced that it had included more than 60,000 books. Because it is not copyrighted, Project Gutenberg has been widely used for AI model training, and in 2018, a team of AI researchers built on Project Gutenberg to create the Standardized Gutenberg Corpus of over 50,000 books. "(Standardized Project Gutenberg Corpus). On the other hand, it is very likely that "Books2" originates from "shadow libraries" on the Internet. The "Books2" dataset contains about 29,400,000 books, and only the much-criticized "shadow libraries" are able to provide a book corpus of this size. Examples include Library Genesis, Z-Library, Sci-Hub, and Bibliotik. The term "shadow library" was coined by the U.S. Social Science Research Council, in its 2011 article "Media Piracy in Emerging Economies," to refer to websites that infringingly include large numbers of books and make them available to the public for free.2023 In March, Open AI released the GPT-4 Enterprise paper, but stated that "The structure and content of the training dataset will no longer be disclosed in relation to the industry's competitive situation and product application security perspective."
05 Six Allegations of Infringement Facing Open AI
Plaintiffs have launched a total of six counts against Open AI, the first three involving copyright infringement, the fourth involving unfair competition, and the fifth and sixth involving two basic categories of civil liability-duty of care and unjust enrichment.
First, direct copyright infringement. Plaintiffs did not authorize Open AI to make copies of their books, to make renditions, or to publicly display or distribute said copies or renditions.
In addition, Plaintiffs emphasize that because Open AI's Big Language Model requires the extraction and preservation of expressive information from Plaintiffs' books in order to function, the Big Language Model itself constitutes an infringing deductive work in the absence of Plaintiffs' authorization.
Second, Copyright Substitution Infringement. Plaintiffs emphasize that in the absence of authorization, each output of the Big Model constitutes an infringing deductive work. Because of the right and ability to control the content output of the Big Language Model and the financial benefit derived from it, Open AI constitutes copyright alternative infringement.
Under the U.S. case law system, "vicarious infringement" and "aiding and abetting infringement" together constitute a complete system of indirect copyright infringement. Indirect infringement, as opposed to direct infringement, means that although the infringer does not directly engage in the behavior regulated by copyright exclusive rights (i.e., direct copyright infringement), it provides certain conditions for direct copyright infringement.
Third, violation of the copyright management information provisions of the DMCA. From the perspective of the product design mechanism, ChatGPT's output does not retain the "copyright management information" (CMI) of the work, so Defendant's intentional removal of Plaintiff's work's CMI violates the provisions of the Digital Millennium Copyright Act (DMCA). In addition, Defendants' unauthorized distribution of infringing renditions of the Works that do not contain CMI also violates the DMCA.
"Copyright Management Information" is information that identifies the owner of the rights to the work in question, the ownership of the rights, and the conditions of use. Whether in the United States or in this country, it is a violation of the law to remove or alter copyright management information, or to make available to the public a work from which copyright management information has been removed or altered.
Fourth, unfair competition.Open AI's unauthorized use of Plaintiffs' copyrighted works for model training violates the California Business and Professions Code because it is improper, unethical, coercive, and injurious to consumers.
Defendants intentionally designed ChatGPT to output snippets and excerpts of Plaintiff's works without attribution.ChatGPT develops commercial products to gain unfair advantage and fame by concealing authorship and reproducing the content and ideas of infringing works.
Fifth, negligent infringement is a breach of the duty of care.Open AI is subject to a duty of care under the California Civil Code-that an owner should act in a reasonable manner with respect to others. This duty is based on industry custom, business practice, the information available to the defendant, and the ability to control based on that information.
Once a defendant collects a plaintiff's copyrighted works for the purpose of training a GPT model, it owes a duty of care not to make infringing use of the works when it foresees that unauthorized use of the works for modeling purposes will cause harm to the plaintiff.
Sixth, unjust enrichment. Plaintiff for the creation of the book in question to pay a substantial amount of time and energy. Because their works were used without authorization to train GPT models, Plaintiffs were deprived of the right to profit from their works as they otherwise would have. It is unfair to Defendants to occupy the commercial benefits of using Plaintiffs' works to train GPT models. Unless enjoined or restrained, Defendants' conduct will cause Plaintiff irreparable harm.
In conclusion: three issues to be explored in this case.
As the first representative action for ChatGPT copyright infringement, a formal judgment from the Northern District of California will be a long process. Until then, however, there are still a number of issues that deserve attention and consideration in light of the specifics of the plaintiff's complaint.
Concern #1: Discovering model infringement is not easy.
The training of large language model is essentially a machine internal, non-explicit works utilization behavior, copyright owners exist to discover their own works are infringed the reality of the problem. Generally speaking, the copyright owner can only find out that there is unauthorized exploitation of his/her work during the training stage of the model by comparing the content generated by the model and his/her own work, which is substantially similar to each other. In this case, the plaintiff was able to allege that his book was infringed by Open AI's large language model, which was deduced from the discovery that ChatGPT had outputted a summary of his work.
However, whether this claim is valid remains to be explored. If ChatGPT's output of summaries of its own work is based on its own collection of public descriptions of the Plaintiffs' books on the Internet, rather than on direct copying and training of the Plaintiffs' books, the legitimacy of the infringement allegation would be shaken. Plaintiff also admits that ChatGPT's output of its own book summaries contains a small number of factual errors, suggesting to some extent that the larger model may not have learned the book in its entirety.
Concern #2: What rights are being infringed is up for debate.
At present, although "the act of storing work data" can formally fall into the scope of regulation of "copy right" under the copyright law, the core "act of training work data", whether it is an infringement of copyright and what kind of copyright infringement it is. However, there is no consensus on whether the core "act of training work data" is an infringement of copyright and what kind of copyright rights it infringes. In this case, the plaintiff emphasized that the normal operation and content output of the large language model is based on the training of the work corpus, so the training of the large model constitutes copyright infringement, and the large model itself constitutes an infringing rendition of the work.
This claim also remains to be explored. In addition to a few cases similar to this case, "to promote the way to request to summarize, summarize, translate specific copyrighted works" and such special content generation needs, the vast majority of cases in which the big model receives open-ended content generation commands (not limited to specific works, specific writers' styles), basically will not output a specific work, or even a specific work's In most cases, big models receive open-ended content generation commands (without limiting to specific works or styles of authors), and basically do not output specific works or even fragments of specific works, which does not constitute copyright infringement.
Concern 3: Upstream and downstream responsibilities need to be clarified.
In the field of big model copyright, the model developer enjoys the relevant rights for the big model itself, so it bears the copyright responsibility involved in the model training; and for the content output from the big model, from the current industry practice, the prevailing practice is to make it clear that the rights and responsibilities belong to the users by way of contract. On July 10, 2023, the Interim Measures for the Administration of Generative Artificial Intelligence Services issued by the Office of the Internet Information Office also explicitly endorsed that "the provider should sign a service agreement with the user to clarify the rights and obligations of both parties."
It is worth paying attention to, from the plaintiff's lawsuit request, also follow the
No comments yet