Notes prepared for Wikimedia NYC’s “Wikiday 2024,” co-hosted by Columbia University’s Brown Institute for Media Innovation.
You can watch the 40-minute talk, including a tangent or two that isn’t in the following remarks, thanks to the Internet Society’s archive of the stream here (my talk starts at the very beginning of the recording).
I am grateful also to feedback from the Open Future Foundation where I was invited to share this work!
Part 1: Art
It's great to be here today. As an introduction, I'm an artist who works with AI from a critical position. I'm a digital artist, but I care deeply about the impacts of technology on our social world. I write and teach about AI — specifically, generative AI, and what we sometimes call "AI images."
As an example, I want to show a new piece, or a quick snippet of it, and then I want to move on to some of the things I've learned about art, archives, and data. SWIM is a meditation on visual archives, the datafication of cultural memory, and training data in the age of generative artificial intelligence. It's a visualization of the AI training process, applied to a digital archival film. The film is slowly diffused as its subject frolics underwater. The actress slowly dissolves into 'training data,' amidst a backdrop made from visualizations of a generative AI system searching for patterns in the absence of any training data at all. I hope this film encourages us to reflect on the deep meaning within of our visual archives, and to think about what it means to "diffuse" this cultural heritage into the vast latent space of artificial intelligence models.
That’s an example of the work I make, which gives me opportunities to think about these tools, how they work, and our relationships to them. Including what they might obscure. Today, I want to lay out a bit of a landscape as I’ve seen it through the lens of working with these tools.
When we make an image with an AI system, we are creating a data visualization. These images don't copy and paste aspects of images into new arrangements. Take a model like Stable Diffusion, or Midjourney. These tools are the result of billions of images that have been assembled from a wide variety of sources, Wikimedia Commons and Flicker Commons images certainly among them.
To build these models, these images are stripped of information in stages, and each time information is removed, the model learns how noise is distributed through the image. That noise appears in certain clusters, and the model learns how to repair that damage. Eventually the model can reconstruct the source image from nothing but static. The paths and structure of this noise becomes an algorithm for repairing noise. When we generate an image with the resulting tools, we type a prompt, and that prompt connects to the captions of these images.
If we ask for flowers, the model will examine a new, random jpeg full of static. It will again add and remove noise based on clusters in that static that most closely resemble the clusters in the training data associated with "flower."
This is why I call these images data visualizations, or infographics. They present the most statistically likely representation of the words you give it. Without data, these images don't exist. The images in the training data comes to shape and structure the randomly generated noise that the model starts with. In a sense, the contents of our archives haunt these images. The archives influence which pixel is predicted next. And embedded into that are ideas of collected memory, which can lead to the reproduction of stereotypes.
Which is why, if we care about the future, we need to think carefully about the content of our archives. How our digital archives, like Wikimedia Commons, come to be created and labeled. And because these images reflect the content of our archives, it matters that the archives reflect a cultural diversity. We have seen evidence that images produced by AI models reproduce stereotypes; it should be clear from the technical description of how they work that this is what these models do. An apple is created from the features that overlap descriptions of apples on the Internet. But so too is a description of a doctor, or a Mexican, or an American. The result, unsurprisingly, contains mostly stereotypes: Rest of World published a report showing that, over 3000 requests for the word “Mexican,” Midjourney would generate images of men in sombreros. Three thousand times.
So, let’s look at these archives.
Part 2: Archives
When we talk about data in generative AI models, we are talking about images. When we talk about datasets, we are talking about vast collections of images. It would not be a mistake to say that a vast collection of images is an archive.
But we don't say archives, we say datasets. An archive, I think, proposes a greater degree of responsibility. Archives are curated. Wikimedia collections, for example, rely on humans to examine and assess uploaded images. In artificial intelligence circles, we say that "humans are in the loop." Wikimedia Commons is a community curated archive. It's also a dataset, if we look at it from an AI perspective. It's a dataset of 45 million images.
By contrast, there are datasets where humans are not in the loop. That would include a massive dataset that was used to train Stable Diffusion, which is called LAION 5B. LAION 5B is a dataset, but I would argue it is also an archive. LAION contains 5 billion data points, 2.3 billion of which are image urls, and the rest are captions describing those images.
As datasets go, it’s made some amazing technologies possible. Whether you agree with how they’re deployed or built, I do think, as a digital artist at least, the technological achievement that is generative AI is pretty remarkable. But by archival standards, it is a disaster. Now, I don't want to blame LAION folks for this. In fact, there are a lot of overlaps between Wikimedia folks and LAION folks: both are volunteers, both have an interest in open and freely accessible knowledge. But hear me out, because there are some very important distinctions, too.
The LAION 5B dataset is a product of web scraping from Common Crawl, with very little filtration of what was collected and what it stored. It contains captions and links to images, and it has been proven to contain copyrighted material. More troubling, however, is a recent study from the Stanford Internet Observatory, which found LAION contains over 1,000 documented images of Child Sexual Abuse. 3,000 more images were suspected, but the URLs in the dataset pointed to material that had been removed. The result is that this model, which was the centerpiece not only of open generative AI tools but also significant academic research into the world wide web, is now offline.
Wikimedia Commons is certainly part of that LAION training data. But there is a night and day difference between it and the way Wikimedia Commons has been built. I think Wikimedia Commons serves as an excellent example of how we can build open, transparent, and consent-based datasets for AI training.
The secret formula, the thing that sets Wikimedia Commons apart from LAION, is that it was not designed to be a dataset at all. It was designed to be an archive.
This is a semantic distinction with real meaning. Wikimedia Commons is a source of constant conversation and consensus building. It involves debates over deletion and inclusion among the volunteers curating Wikimedia Commons, the nitpicking about licensing — this is evidence of humans in the loop.
As an example, my own upload of a Syrian postage stamp from the 1980s was flagged as a copyright violation a few years ago, because the Syrian government doesn’t put its stamps in the public domain. Somebody checked that and intervened in my upload. As annoying as I found that, I also respected it: this is what it looks like when humans decide the future of datasets. There is an effort to determine consensus and fact check everything from captions on the Commons to the context in which images appear on Wikipedia.
Commons is the ideal structure for building equitable AI systems, though I would never say it is perfect. Representation gaps remain on Wikimedia Commons. It is still largely oriented to white, English-speaking countries with internet access. Bias hangs out on Wikipedia talk pages and Commons deletion requests. But it is an archive built by voluntary contributions, and the best of Wikipedia comes about through the efforts of volunteers committed to closing those gaps and broadening the scope of what’s included. Because this is a collective project, you can have things like the Wiki Project Women in Red, which raise the alarm on bias and seek out ways to correct it. I think this is a powerful starting point.
AI image tools, on the other hand, were built on images grabbed without clear consent on copyright, oftentimes including not just copyright violations, but personal and historical atrocities: images that would immediately be flagged for deletion on Commons are never reviewed. Wikimedia Commons is an archive with room to be concerned about balance, and its imperfect balance based on a complicated and always changing concept of neutrality. By contrast, AI image tools were built on a dataset that is literally used to study hate speech online.
Too often I am told that there must be open data policies for use in AI, and that having this open data means that companies should be able to train on anything they want. But that isn't what open data means to me. I come back to this distinction, between datasets and archives. If we want to build a more responsible AI, if we want to build tools that benefit others rather than exploit them, we can start by thinking about training data as a cultural archive. An archive that requires stewardship, allows for debate and discussion, and which makes space for people who might not be represented under the current status quo. An archive requires curation. If we want to build responsible, beneficial AI, we need curation, too.
Part 3: Data
Let’s go back to this idea of data in the dataset. What does it mean for something to be data? The Oxford dictionary — a fancy way of saying the first result on Google — tells us what data is. Data is, or data are, "facts and statistics collected together for reference or analysis."
This strikes me as a weird way to think about art. I have never gone to a museum to be moved by the facts they've assembled for my analysis. On a certain level of abstraction, humans may do this, but I find that level of abstraction to be so wide as to be useless.
What are the facts assembled in a Rothko painting? There are none. It is expressive.
The data about that image is something else. It's the color values associated with pixels in a digital copy of the painting. It may be the date the painting was made, the measurements of a canvas. These are the facts of an image. When a computer looks at Rothko's work, that is what it sees. This is what ends up abstracted and reproduced. As a person with an interest in the digital humanities, this data is incredibly useful, as is the ability to analyze millions of lines of texts in novels and code them for themes. There is a use of this data in learning about the art that people make, the things we do online, and it’s important to value that research and protect it.
But in the process of measuring a work of art in countless ways, we document its least meaningful aspects. The problem with framing the collection of art as a dataset is that it captures nothing about art that I care to reproduce electronically. I do not express myself through the reduction of Rothko and Batman into pixel assemblages.
I don't even mean to say that AI art cannot be expressive. I think that argument is tired: we express ourselves in all kinds of ways. If Nam June Paik can make something interesting with a broken TV, certainly art can be made from AI. Let's not be distracted by that argument. It's not what I mean to say.
So perhaps let's move away from whether it is art and think about how we want it to be made. What are its recipes and formulas? Let's look at other things that have become training data: archival images of victims of the holocaust. The images of torture that took place at Abu Ghraib. Images of child abuse. Racist images, misogynistic images.
If we want to say that we have no responsibility to affirm the contextual and social meaning of the images in our datasets, then perhaps you might agree that we do have a responsibility to reject the exploitation of others in that data. That there comes no benefit from making images that contain even the slightest residue, or derived the slightest benefit, from the further exploitation of a child that has been tortured. Because these images, deep within the datasets, do haunt the structure and the shape of our images. Future models may be fine tuned, and already we have seen datasets that focus on these kinds of images explicitly to generate similar content.
If you're feeling repulsed, I apologize, but this is data. These are facts being brought together for analysis. And my analysis concludes with the realization that the way we are structuring AI is as if images have no meaning. To treat these images as nothing but data ignores the reasons we make images, or express ideas. It ignores what it is that images document and describe. In other words, it severs the culture from what is meant to be cultural.
I know many artists who hate AI. I'm a digital artist, and I distrust big tech. But I am also driven by a hopeful vision of AI. I want an AI that humans shape. I want people to have a say in how their data is used, and I want an AI that was built with consent, possibly even with joy. Can you imagine it? I know for many of you — especially if you are in the midst of an editing war — the idea that the Wikimedia ecosystem is a source of joy and community may not always feel accurate. But look at the folks in the room today, look at the conversations and friendships that emerge from this building of a collective project.
It's my weird hope that if we pivot to build AI from those values, we might just build better systems. Systems that can be governed and steered by people, exerting their agency over data, instead of data exerting its power over us.
Part 4: Tensions
This talk is about swimming in the tensions. There are many tensions in AI that merit meaningful discussion, and I’m always surprised by the rush I see in our attempts to solve them. Part of having humans in the loop — whether building a dataset or a democracy — is about friction, dialogue, and trade offs. We may never firmly resolve these questions. That isn’t how culture, or societies, or democracies, work.
There are powerful arguments for open data: datasets built and distributed freely, with full transparency. It helps us to learn about the world. On the other hand, these datasets may collect material we don’t want them to have, and be used toward purposes we don’t want to see. Should our personal data be part of these datasets? Should companies lock up their training data? If so, how do we build transparency into those systems — to know what they’re built on?
I see tensions about copyright law and fair use and whether training data should be fair use. We argue that the substance of what’s collected from these archives of images are just “facts,” ratios of pixel distribution and word correspondences. Part of me says: that’s fair, and we ought to have the tools to examine the vast sums of text and images we’re swimming in. We should have new tools to do research and to learn.
But I ask you: is that what Generative AI is doing when we make an image? It feels different. I’m not paying a company to learn anything. I pay them to play with this information, but I don’t want to play with people’s personal expressions and painful experiences. To say that this is all data is a bit like telling me that my dog is just atoms, or that my memories are just chemicals.
Yes, under one microscope, that’s all true. But the validity of emotion, of being human — the principle of dignity itself — is that we resist reducing the world to pure mechanics. Symbols are the building blocks of meaning. If we strip images and words of their context, they lose symbolic value. If they lose that connection to our social and cultural fabric, they lose that meaning, and so do we. Wikimedia projects are always a network: links to links. The context is always there. A Wikipedia page that does not link out is a red link, a snippet of text without deeper reference. That’s what our AI training data is: 5 billion data points devoid of connections. 5 billion red links.
For me, as an artist who works with images and archives, there is a great care that is required when we consider what is inside of these archives. It is my hope that Wikimedia might serve as a model for thinking about governance and norms with the collection of these images. Rather than drifting into the mindset of data brokers, it is critical that we approach this from a position of an archivist, a historian, a humanitarian. We need to see knowledge as precious, as a collective project, to push for more people to be involved, not less; for humans to insist that meaning and context matters — and to preserve, and contest those contexts, in all their complexity.
That is to say, I hope you all continue to build an archive first, and to value the meaning of what you’ve all built, and all that it contains.
Thank you!
Thanks for sharing this talk here, I enjoy your point of view a lot and appreciate how you consider AI critically without dismissing it entirely as an artist.
The point about images of child abuse is definitely one I need to mull over. There's a fair bit of discussion in writing spheres about depictions of trauma/abuse that reminds me of this. One of the points is that agency and ownership are extremely important in preventing exploitation of stories of abuse. For example, questioning if there is a meaningful difference between a survivor sharing their story to raise awareness, or depictions of trauma by someone who hasn't experienced it for shock entertainment value (discussion around the treatment of women in Game of Thrones for example). This ties in neatly to what you're suggesting about the human element and context in archives. What does it mean for these images to be included in databases? What does it mean to generate this kind of content from an AI database? How do we feel about this as a society?
I sadly don't really have an argument to make here, just wanted to say it's an extremely difficult topic I try to be mindful of. It's great when these tricky ethical questions are brought up, so thanks for your open mind and honesty.
Well said. Only caveat is that the people loop needs to be balanced. We do not need pictures of George Washington as a black person or refusal to generate true images of people without bias embedded in the algorithms. This is why even the human loop needs to be vetted.