Simon Willison’s Weblog

Subscribe
Atom feed for copyright

13 items tagged “copyright”

2025

Baroness Kidron’s speech regarding UK AI legislation (via) Barnstormer of a speech by UK film director and member of the House of Lords Baroness Kidron. This is the Hansard transcript but you can also watch the video on parliamentlive.tv. She presents a strong argument against the UK's proposed copyright and AI reform legislation, which would provide a copyright exemption for AI training with a weak-toothed opt-out mechanism.

The Government are doing this not because the current law does not protect intellectual property rights, nor because they do not understand the devastation it will cause, but because they are hooked on the delusion that the UK's best interests and economic future align with those of Silicon Valley.

She throws in some cleverly selected numbers:

The Prime Minister cited an IMF report that claimed that, if fully realised, the gains from AI could be worth up to an average of £47 billion to the UK each year over a decade. He did not say that the very same report suggested that unemployment would increase by 5.5% over the same period. This is a big number—a lot of jobs and a very significant cost to the taxpayer. Nor does that £47 billion account for the transfer of funds from one sector to another. The creative industries contribute £126 billion per year to the economy. I do not understand the excitement about £47 billion when you are giving up £126 billion.

Mentions DeepSeek:

Before I sit down, I will quickly mention DeepSeek, a Chinese bot that is perhaps as good as any from the US—we will see—but which will certainly be a potential beneficiary of the proposed AI scraping exemption. Who cares that it does not recognise Taiwan or know what happened in Tiananmen Square? It was built for $5 million and wiped $1 trillion off the value of the US AI sector. The uncertainty that the Government claim is not an uncertainty about how copyright works; it is uncertainty about who will be the winners and losers in the race for AI.

And finishes with this superb closing line:

The spectre of AI does nothing for growth if it gives away what we own so that we can rent from it what it makes.

According to Ed Newton-Rex the speech was effective:

She managed to get the House of Lords to approve her amendments to the Data (Use and Access) Bill, which among other things requires overseas gen AI companies to respect UK copyright law if they sell their products in the UK. (As a reminder, it is illegal to train commercial gen AI models on ©️ work without a licence in the UK.)

What's astonishing is that her amendments passed despite @UKLabour reportedly being whipped to vote against them, and the Conservatives largely abstaining. Essentially, Labour voted against the amendments, and everyone else who voted voted to protect copyright holders.

(Is it true that in the UK it's currently "illegal to train commercial gen AI models on ©️ work"? From points 44, 45 and 46 of this Copyright and AI: Consultation document it seems to me that the official answer is "it's complicated".)

I'm trying to understand if this amendment could make existing products such as ChatGPT, Claude and Gemini illegal to sell in the UK. How about usage of open weight models?

# 29th January 2025, 5:25 pm / politics, ethics, generative-ai, training-data, ai, copyright, deepseek

2024

Releasing Common Corpus: the largest public domain dataset for training LLMs (via) Released today. 500 billion words from "a wide diversity of cultural heritage initiatives". 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.

Includes quite a lot of US public domain data - 21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)

This is only an initial part of what we have collected so far, in part due to the lengthy process of copyright duration verification. In the following weeks and months, we’ll continue to publish many additional datasets also coming from other open sources, such as open data or open science.

Coordinated by French AI startup Pleias and supported by the French Ministry of Culture, among others.

I can't wait to try a model that's been trained on this.

# 20th March 2024, 7:34 pm / llms, ai, ethics, generative-ai, copyright, training-data, pleias

OpenAI and journalism. Bit of a misleading title here: this is OpenAI’s first public response to the lawsuit filed by the New York Times concerning their use of unlicensed NYT content to train their models.

# 8th January 2024, 6:33 pm / llms, generative-ai, openai, new-york-times, ai, copyright

We believe that AI tools are at their best when they incorporate and represent the full diversity and breadth of human intelligence and experience. [...] Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

OpenAI to the Lords Select Committee on LLMs

# 8th January 2024, 5:33 pm / copyright, generative-ai, openai, ai, llms, politics, training-data

2023

I’ve resigned from my role leading the Audio team at Stability AI, because I don’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.

[...] I disagree because one of the factors affecting whether the act of copying is fair use, according to Congress, is “the effect of the use upon the potential market for or value of the copyrighted work”. Today’s generative AI models can clearly be used to create works that compete with the copyrighted works they are trained on. So I don’t see how using copyrighted works to train generative AI models of this nature can be considered fair use.

But setting aside the fair use argument for a moment — since ‘fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong. Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works.

Ed Newton-Rex

# 15th November 2023, 9:31 pm / stable-diffusion, ethics, generative-ai, ai, copyright, training-data, text-to-image

Beyond these specific legal arguments, Stability AI may find it has a “vibes” problem. The legal criteria for fair use are subjective and give judges some latitude in how to interpret them. And one factor that likely influences the thinking of judges is whether a defendant seems like a “good actor.” Google is a widely respected technology company that tends to win its copyright lawsuits. Edgier companies like Napster tend not to.

Timothy B. Lee

# 3rd April 2023, 3:38 pm / generative-ai, ai, copyright

Stable Diffusion copyright lawsuits could be a legal earthquake for AI. Timothy B. Lee provides a thorough discussion of the copyright lawsuits currently targeting Stable Diffusion and GitHub Copilot, including subtle points about how the interpretation of “fair use” might be applied to the new field of generative AI.

# 3rd April 2023, 3:34 pm / stable-diffusion, generative-ai, github-copilot, ai, copyright, text-to-image

mitsua-diffusion-one (via) “Mitsua Diffusion One is a latent text-to-image diffusion model, which is a successor of Mitsua Diffusion CC0. This model is trained from scratch using only public domain/CC0 or copyright images with permission for use.” I’ve been talking about how much I’d like to try out a “vegan” AI model trained entirely on out-of-copyright images for ages, and here one is! It looks like the training data mainly came from CC0 art gallery collections such as the Metropolitan Museum of Art Open Access.

# 23rd March 2023, 2:56 pm / generative-ai, ethics, copyright, ai, training-data

2010

Copyright is hard. Dan Catt spots Adam Liversage (Director of Communications for the BPI, and hence right in the middle of the DEBill) suggesting his wife violate copyright on Twitter.

# 11th April 2010, 2:26 pm / dan-catt, copyright, debill, bpi

2008

Free licenses upheld by US “IP” court. Free software and CC licenses which dictate conditions that, when violated, turn you in to a copyright infringer now have precedence in US law.

# 14th August 2008, 9:33 am / law, uslaw, creativecommons, freesoftware, open-source, licenses, copyright, lawrence-lessig

The Open Web Foundation. Launched today at OSCON, an independent, non-profit organisation dedicated to incubating and protecting new specifications like OAuth and oEmbed. The focus is incubation, licensing, copyright and community.

# 24th July 2008, 5:40 pm / oauth, oembed, copyright, oscon, oscon08, openwebfoundation, openweb

2007

Open Rights Group: Our first two years. ORG’s review of the past two years shows just how worthwhile a cause they have become—highlights include their hugely successful campaign against copyright term extension and their involvement in this year’s e-voting trials.

# 25th November 2007, 10:05 pm / org, openrightsgroup, digitalrights, evoting, elections, copyright

Please, fanboys, don't send me dumb notes averring that Apple's failure to police this use of its mark will lead to the end of its ability to stop manufacturers from producing rival MP3 players and calling them iPods. That's a fairy tale that trademark lawyers tell their kids when they want to reassure them that they'll have a healthy college fund.

Cory Doctorow

# 12th February 2007, 2:05 pm / copyright, boingboing, corydoctorow, apple