Authors Accuse OpenAI of Using Pirate Sites to Train ChatGPT

June 30, 2023

57 3 minutes read

Generative AI models such as ChatGPT have captured the imagination of millions of people, offering a glimpse of what an AI-assisted future might look like.

The new technology also brings up novel copyright questions. Several rightsholders are worried that their work is being used to train AI without any form of compensation, for example.

How these and other copyright questions will be dealt with is not entirely clear. Governments around the world are taking different approaches, with U.S. Congress recently stating that it doesn’t plan to overreact. Meanwhile, rightsholders don’t intend to stand idly by.

Authors Sue OpenAI for Copyright Infringement

This week, authors Paul Tremblay and Mona Awad filed a class action lawsuit against OpenAI, accusing ChatGPT’s parent company of copyright infringement and violating the DMCA, among other things. According to the authors, ChatGPT was partly trained on their copyrighted works, without permission.

The proof for this claim is seemingly simple. The authors never gave OpenAI permission to use their works, yet ChatGPT can provide accurate summaries of their writings. This information must have come from somewhere.

“Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works,” the complaint reads.

Pirate Training

While these types of claims are not new, this week’s lawsuit alleges that OpenAI used pirate websites as training input. This potentially includes Z-Library, a shadow library of millions of pirated books that’s at the center of a criminal prosecution by the U.S. Department of Justice.

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

“The only ‘internet-based books corpora’ that have ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems.”

Based on these data points, the complaint concludes that OpenAI committed copyright infringement. As compensation, the plaintiffs demand statutory damages, which can reach $150,000 per work. Additional damages for the alleged removal of copyright management information, in violation of the DMCA, are also on the table.

AI, Piracy and Copyright

There is no direct evidence that OpenAI used pirate sites to train ChatGPT. That said, it is no secret that some AI projects have trained on pirated material in the past, as an excellent summary from Search Engine Journal highlights.

The mainstream media has picked up this issue too. The Washington Post previously reported that the “C4 data set,” which Google and Facebook used to train their AI models, included Z-Library and various other pirate sites.

“At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set,” the article added.

The present lawsuit will be closely watched by AI enthusiasts and rightsholders. It may result in OpenAI having to disclose some of its training data, which would be interesting in its own right

Even if it transpires that ChatGPT was trained with pirated books, the court would still have to decide whether that amounted to copyright infringement. Some experts believe that this type of AI training can be considered fair use.

Fair use protects transformative uses of copyrighted works that don’t compete with the original content. According to several experts, that defense could likely apply to AI training cases.

—

A copy of the complaint filed against OpenAI at the federal court for the Northern District of California is available here (pdf)

From: TF, for the latest news on copyright battles, piracy and more.

TorrentFreak

June 30, 2023

57 3 minutes read