The New York Times’ AI copyright lawsuit shows that forgiveness might not be better than permission

The New York Times’ (NYT) legal proceedings against OpenAI and Microsoft have opened a new frontier in the ongoing legal challenges brought on by the use of copyrighted data to “train”, or improve generative AI.

There are already a variety of lawsuits against AI companies, including one brought by Getty Images against StabilityAI, which makes the Stable Diffusion online text-to-image generator. Authors George R.R. Martin and John Grisham have also brought legal cases against ChatGPT owner OpenAI over copyright claims. But the NYT case is not “more of the same” because it throws interesting new arguments into the mix.

The legal action focuses in on the value of the training data and a new question relating to reputational damage. It is a potent mix of trade marks and copyright and one which may test the fair use defences typically relied upon.

It will, no doubt, be watched closely by media organisations looking to challenge the usual “let’s ask for forgiveness, not permission” approach to training data. Training data is used to improve the performance of AI systems and generally consists of real-world information, often drawn from the internet.

The lawsuit also presents a novel argument – not advanced by other, similar cases – that’s related to something called “hallucinations”, where AI systems generate false or misleading information but present it as fact. This argument could in fact be one of the most potent in the case.

The NYT case in particular raises three interesting takes on the usual approach. First, due to its reputation for trustworthy news and information, NYT content has enhanced value and desirability as training data for use in AI.

Second, due to its paywall, the reproduction of articles on request is commercially damaging. Third, ChatGPT’s “hallucinations” are causing reputational damage to the New York Times through, effectively, false attribution.

This is not just another generative AI copyright dispute. The first argument presented by the NYT is that the training data used by OpenAI is protected by copyright, and so they claim the training phase of ChatGPT infringed copyright. We have seen this type of argument run before in other disputes.

Fair use?

The challenge for this type of attack is the fair use shield. In the US, fair use is a doctrine in law that permits the use of copyrighted material under certain circumstances, such as in news reporting, academic work and commentary.

OpenAI’s response so far has been very cautious, but a key tenet in a statement released by the company is that their use of online data does indeed fall under the principle of “fair use”.

Anticipating some of the difficulties that such a fair use defence could potentially cause, the NYT has adopted a slightly different angle. In particular, it seeks to differentiate its data from standard data. The NYT intends to use what it claims to be the accuracy, trustworthiness and prestige of its reporting. It claims that this creates a particularly desirable dataset.

It argues that as a reputable and trusted source, its articles have additional weight and reliability in training generative AI and are part of a data subset that is given additional weighting in that training.

It argues that by largely reproducing articles upon prompting, ChatGPT is able to deny the NYT, which is paywalled, visitors and revenue it would otherwise receive. This introduction of some aspect of commercial competition and commercial advantage seems intended to head off the usual fair use defence common to these claims.

It will be interesting to see whether the assertion of special weighting in the training data has an impact. If it does, it sets a path for other media organisations to challenge the use of their reporting in the training data without permission.

The final element of the NYT’s claim presents a novel angle to the challenge. It suggests that damage is being done to the NYT brand through the material that ChatGPT produces. While almost presented as an afterthought in the complaint, it may yet be the claim that causes Open AI the most difficulty.

This is the argument related to AI “hallucinations”. The NYT argues that this is compounded because ChatGPT presents the information as having come from the NYT.