NVIDIA: Copyrighted Books Are Just Statistical Correlations to Our AI Models

August 17, 2024

11 4 minutes read

Over the past two years, AI developments have progressed at a rapid pace.

This includes large language models, which are typically trained on a broad datasets of texts; the more, the better.

When AI hit the mainstream, it became apparent that rightsholders are not always pleased that their works were used to train AI. This applies to photographers, artists, music companies, journalists, and authors, some of whom formed groups to file copyright infringement lawsuits to protect their rights.

Book authors, in particular, complained about the use of pirated books as training material. In various lawsuits, companies including OpenAI, Microsoft, Meta, and NVIDIA are accused of using the ‘Books3’ dataset, which was scraped from the library of ‘pirate’ site Bibliotik.

After the Books3 accusations hit mainstream news, many AI companies stopped using this source. Meanwhile, anti-piracy companies helped publishers to take the alleged rogue libraries offline to prevent further damage.

These enforcement efforts aren’t limited to Books3 either, or the English language for that matter; earlier this week anti-piracy group BREIN reported that it helped to remove a Dutch language dataset.

Authors sued NVIDIA

Earlier this year, several authors sued NVIDIA over alleged copyright infringement. The class action lawsuit alleged that the company’s AI models were trained on copyrighted works and specifically mentioned Books3 data. Since this happened without permission, the rightsholders demand compensation.

The lawsuit was followed up by a near-identical case a few weeks later, and NVIDIA plans to challenge both in court by denying the copyright infringement allegations.

In its initial response, filed a few weeks ago, NVIDIA did not deny that it used the Books3 dataset. Like many other AI companies, it believes that the use of copyrighted data for AI training is a prime example of fair use; especially when the output of the model doesn’t reproduce copyrighted works.

The authors clearly have a different take. They allege that NVIDIA willingly copied an archive of pirated books to train its commercial AI model, and are demanding damages for direct copyright infringement.

Trial in Two years…?

This week, the authors and NVIDIA filed a joint case management statement at a California court, laying out a preliminary timeline. This shows that both parties intend to take their time to properly litigate the matter.

The authors expect that the parties need until October next year to gather facts and evidence during the discovery phase. An eventual jury trial is penciled in a full year later, November 2026.

NVIDIA doesn’t have a hard trial deadline in mind but stresses that the fair use issue is key, and should be addressed early and efficiently. For starters, the company intends to file a motion for summary judgment within a year, after which both parties should have more clarity.

Facts, Figures, and Statistical Correlations

Aside from the timeline, NVIDIA also shared its early outlook on the case. The company believes that AI companies should be allowed to use copyrighted books to train their AI models, as these books are made up of “uncopyrightable facts and ideas” that are already in the public domain.

The argument may seem surprising at first; the authors own copyrights and as far they’re concerned, use of pirated copies leads to liability as a direct infringer. However, NVIDIA goes on to explain that their AI models don’t see these works that way.

AI training doesn’t involve any book reading skills, or even a basic understanding of a storyline. Instead, it simply measures statistical correlations and adds these to the model.

“Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model. Plaintiffs do not try to claim a copyright over those statistical correlations, asserting instead that the training data itself is ‘copied’ for the purposes of infringement,” NVIDIA writes.

Put differently, NVIDIA argues that its AI models don’t use the books the way humans do; neither do they reproduce them. It’s simply examining the ‘facts and ideas’ in the books, ‘transforming’ their original purpose to build a complex AI model. That qualifies as fair use, they state.

“Plaintiffs cannot use copyright to preclude access to facts and ideas, and the highly transformative training process is protected entirely by the well-established fair-use doctrine.

“Indeed, to accept Plaintiffs’ theory would mean that an author could copyright the rules of grammar or basic facts about the world. That has never been the law, for good reason,” NVIDIA adds.

Fair Use Battle

According to NVIDIA, the lawsuit boils down to two related questions. First, whether the authors’ direct infringement claim is essentially an attempt to claim copyright on facts and grammar. Second, whether making copies of the books is fair use.

The chip company believes that it didn’t do anything wrong and cites several cases that will likely appear in its future filings. They include the Authors Guild v. Google lawsuit, where the court of appeals concluded that copying books to create a searchable database was fair use. As a result, Google Books still exists today.

NVIDIA is not the only company that will rely on a fair use defense in response to AI-related copyright infringement claims. Many other companies are taking the same approach so whether it succeeds will prove key for the future of AI model development.

What makes these matters more complex is that AI models and technologies have different applications; so what may be fair use in one case, could be copyright infringing in another.

For example, earlier this week, a California federal court ruled that a copyright lawsuit filed by visual artists against DeviantArt, Midjourney, Runway AI, and Stability AI, can move forward. These defendants are also accused of copyright infringement, but the lawsuit deals with images, and image outputs instead.

Given the parties involved and the potential damages at stake, these lawsuits will keep the courts busy for years to come. Even after the first ‘final’ verdicts come in, there will be appeals, and some questions may eventually end up at the Supreme Court.

Meanwhile, the actions of NVIDIA and other AI companies will be closely monitored by copyright watchers. This includes recent press reports accusing NVIDIA, among others, of scraping both videos and transcripts from YouTube, to train their respective models.

—

A copy of the joint case management statement in Nazemian vs. Nvidia is available here (pdf)

From: TF, for the latest news on copyright battles, piracy and more.

TorrentFreak

August 17, 2024

11 4 minutes read