LAION-5B, Stable Diffusion 1.5, and the Original Sin of Generative AI
When it comes to AI Infrastructure, We Aren't Listening to the Right People
I’m writing mid-week to share a link to my latest piece over at Tech Policy Press, which I think covers an important topic. Here’s the introduction:
In The Ones Who Walk Away From Omelas, the fiction writer Ursula K. Le Guin describes a fantastic city wherein technological advancement has ensured a life of abundance for all who live there. Hidden beneath the city, where nobody needs to confront her or acknowledge her existence, is a human child living in pain and filth, a cruel necessity of Omelas’ strange infrastructure.
In the past, Omelas served as a warning about technology. Today, it has become an apt description for generative AI systems. Stanford Internet Observatory’s David Thiel — building on crucial prior work by researchers including Dr. Abeba Birhane — recently confirmed more than 1,000 URLS containing verified Child Sexual Abuse Material (CSAM) is buried within LAION-5B, the training dataset for Stable Diffusion 1.5, an AI image tool that transformed photography and illustration in 2023. Stable Diffusion is an open source model, and it is a foundational component for thousands of the image generating tools found across apps and websites.
I’m also quoted in a Wired Magazine article about Mickey Mouse and Generative AI, more on that subject on Sunday. Cheers!
Interesting as always. It is very important to monitor datasets, and consequently, it is crucial that they are open (and preferably open source). However, I disagree on the copyright issue, for various reasons including that it's also important to be able to freely input data into the training, or AI will inherit the biases of the data for which companies acquire rights (from other companies). Regarding the influence of the dataset on outputs, as you rightly say, we must keep in mind that it is very dangerous for it to end up in unrelated categories. A scene from World War II, if in the right category, 'teaches' correctly, but the problem rather lies if it is labeled as 'justice'. Of course illegal material should not exist in the dataset (and in general: I'm concerned that those images, even if removed from the dataset, remain online somewhere).