New models mean new things to grapple with!
I wasn’t going to write anything this week, as I’m in Australia. But I am also me, and so when OpenAI releases a new video model I am going to go “check out the tech specs” and see what it’s all about. These are rough notes — think of it as live-blogging my reading of the OpenAI Technical document. It’s rough and chatty and most of this post was drafted on a beach.
Calling it a technical document is weird, though, because this document is named “Video generation models as world simulators.” This is about as accurate as calling cinema a world simulator: it’s like, sort of true, but it’s also not exactly precise. It is also contradictory, because the paper admits that the model isn’t great at modeling worlds.
Less hyperbolically, it’s a text-to-video model that claims to generate up to 46 seconds at a pass. I believe them, but it’s worth noting that we only have a few samples being shared by OpenAI: the model isn’t open to the public yet. Assuming that’s true, that’s a big difference from GEN2, Runway’s model that could generate up to 4 seconds at a time and only up to about 16 seconds.
So let’s look at the paper and see if we can read between the lines a bit to figure out what’s going on.
The first technical line has a bit of hype baked in: “We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.”
(This idea of “generalist” is arguable; LLMs have not yet really proven to be equally good at everything. The model still depends on training data and is still constrained by training data; any suggestion that the model could transcend training data should be taken with a grain of salt.)
What seems straightforward is how this gets done. Basically, they’re training on video by speeding it up and decreasing the size through compression. Still oriented in a diffusion model, this reduction is reversible back to a high-res video format. Tellingly, it’s trained on wide screen as well as vertical (cell-phone-style) videos. It does this quickly because it studies the compressed video using patches — basically a transformer model, like text tokens, but for pixel clusters — and can generate new compressed video, but also walk it back to videos with much higher resolutions.
(Rough aside, very technical, skip if you want!) —> For movement, “space and time” is reduced into something called patches. Patches have been around for a while. Patches are essentially low-res clusters of pixels that can operate like a map of all possible movement inside a video space, rendered in a set of slices of time that can be referenced rather than needing to do something as intensive as tracking an object’s trajectory through space and time. What stays still, what moves, etc, it’s all in there. Likewise; images are treated as single-frame videos, so you can get images out of Sora too.
Sora is able to do up to basic high definition resolution because it can be trained on higher res video. Because it is trained on widescreen, vertical and square shaped video, there’s less processing being done at the edge of screens — many video models today are biased toward a square-shaped, 256x256 resolution video. That isn’t how people watch video, though. Much of the “awe” of Sora is simply that it makes use of the whole widescreen field. Much of that is simply the affordance of training on more of those types of videos, an efficiency made possible by new techniques in flattening and analyzing the training data.
Miscellania
A few things:
Seems the video is initially generated very small and very sped-up; then, this noise is slowed down and scaled up iteratively. Like the Big Bang: noisy, tiny, and expanding from no-time to… time.
As far as length of the videos go, it can interpolate the last frame and first frame of same-seed videos, which is maybe why we’ve seen rigid panning or freeze-framing in some of the demos. Longer videos are possibly a stitching together of shorter ones with a “morph” effect? In any case, there are some cool uses of interpolation in the tech specs, where elements of one video “drift” into the other, creating smooth shifts. We will see how reliable those transitions are in the actual app; and how much control we actually have over them.
As for modeling real-world physics: the paper clearly says that Sora can’t render glass breaking or even world-state changes like food bites. They then just state that larger models would inevitably fix that — but there’s no proof! It’s in the discussion section so it’s entirely just wishful thinking.
Misinfo Risks?
Sora’s most concerning ability, from the tech specs, is that it can depict multiple scenarios that *conclude* at a given image:
Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.
That is gonna be a disinfo topic at a few conferences in the near future, I’d guess.
In theory, let’s assume you have a social media video of police that starts from the moment the police start using unwarranted force against a person on the street. This says they can seamlessly create up to 46 seconds of synthetic video that would end where the violence clip starts.
What happens in that 46 seconds is guided by your prompt, whether it’s teenager throws hand grenade at smiling policeman or friendly man offers flowers to angry police.
I have no idea how convincing that false footage would be. But, mix the result with a bit of camera jostle and you’ve got as many versions of history as you want, culminating in an actual event.
Australia Notes
I’ll post more about the trip to Australia next Sunday (or the next one, depending on my recovery time). But I had a wonderful time at ACMI, and you can read about it, til then, in an article published about my remarks (and those of my co-presenter, Katrina Sluis) on ArtsHub called “AI FOMO: Why Cooler Heads are Needed in the Arts.”
Till then!
A very clear technical explanation, thank you for sharing!