The Mirage of Progress

Note: This is a reposted entry from S*bstack. If you read it there, it’s the same.

On the occasion of the release of my new AI-assisted short film, Oasis, and the emergence of Runway’s Gen-2 (not to mention Adobe Firefly), I’ve been once again thinking about AI video and its implications on the future of film. I originally started this essay two months ago, when Runway’s Gen-1 had just been announced, but after playing around with Gen-1 a bit and my own experience using AI for filmmaking with Oasis, it felt like the right moment to finally finish it. I’ll begin by getting up to speed with the current state of AI video tools, then detail how AI fit into my process for Oasis, and provide some speculation on how AI may fit into the future of filmmaking.

Make Way for Video

AI generated text-to-video has been around in some form since at least the days of CLIP+VQGAN, along with various Stable Diffusion options, but they’ve always required specialty knowledge and a lot of fine-tuning, making it effectively inaccessible to the layperson. In the same way that it took the user-friendly interface of ChatGPT to get a critical mass of everyday people using AI chatbots, text-to-video still has a ways to go in both quality of output and UX in order to reach a mass audience. Some promising contenders are the big players’ text-to-video models announced last fall, which remain in the development stage—Google’s Imagen Video and Meta’s Make-a-video both promise to generate a few seconds of HD video based on prompts. There are some other ones as well using Stable Diffusion, in some cases to generate even longer and more complex videos, but as far as I’m aware, none of these are available to the general public yet, and certainly aren’t in the user-friendly state of something like Dall-E or ChatGPT.

The most recent development in consumer-friendly text-to video is Gen-2 by Runway. When I first started writing this post a couple months ago, Gen-1 had just been announced. Now, we’re on the cusp of Gen-2, currently in closed beta, and some people with early access have been posting tests that certainly look more cohesive and stylish than Gen-1. The key difference is that Gen-1 was purely a “video-to-video” model—modifying existing input video with text prompts (or images), but Gen-2 offers text-to-video: generating video clips entirely from scratch with just text inputs, in the same way that Midjourney or Dall-e generate images. See the explainer video below for more details.

I messed around with Gen-1 a bit but was mostly unimpressed. The limitations of clip sizes (initially 3 seconds, it’s since been expanded) and the relatively primitive quality felt like a step down after being used to the high-fidelity text-to-image models. It basically just operates like a highly limited video filter, a weak version of rotoscoping, a tech demo with little to no practical artistic applications. At best, some of the Gen-1 video experiments could be called “cool,” but it never felt like a tool someone could genuinely use in a video workflow.

Gen-2’s text-to-video option is more interesting, at least on paper. I, again, haven’t used it yet, but the promise of a video generator on par with Midjourney or Dall-E could have the potential to shake-up traditional camera-based workflows. The oft-cited fantasy spoken breathlessly by AI speculators of being able to create customized Hollywood-level feature films at the push of a button, is a vision of endless entertainment that feels like the natural apex of the current algorithm-driven streaming landscape. But I think this vision of AI Hollywood, utopian to some, dystopian to some, is misguided, and misses an important point by fixating on the content at the expense of the medium: the difference between images and video.

Image≠Video

When thinking about the text-to-image generators that have become popular over the last year or so, and the text-to-video generators in their nascent state, it’s tempting to compare them 1-to-1 because of their superficial similarity. But this approach obscures the significant difference between the two mediums. Still images have a constrained number of variables to engage a viewer: they have to make their entire impression in the bounds of a rectangle, in one moment in time, silently. There is still infinite room for innovation within these boundaries, but the somewhat limited nature of the medium makes them ideal for AI—it’s easier to learn the rules of what makes a still image good, because “looking good” is one of the main thing an image needs to do to be successful. Certain iconic images, captured or created by skilled artists, can impart meaning that goes beyond “looking good,” but for many purposes, looking good is enough. Because AI text-to-image generators have one narrow variable, “looking good,” to focus on, there’s been a somewhat straightforward path to training and improving the image models, a clear goal with noticeable results.

Video isn’t so simple. Moving images have a much higher number of variables (editing, camera movement, camera/lens choice, performance, casting, sound, genre, and the infinite web of signifiers in-the-world that can be captured by a camera, to name a few) and “looking good” isn’t very high on the list; film is a language with a much larger vocabulary than static images, and therefore a more complex criteria for quality. Think about how many popular Tik Tok creators with millions of followers are just filming themselves in their bedrooms with their iPhones. How a great film can move you emotionally whether you watch it on a crisp 35mm print in a theater, or a washed-out 240p VHS rip you found dubiously online. Or how some films (i.e. Jarman’s Blue, or a bunch of experimental works) don’t even use images at all. Good video, good film, is operating on exponentially more levels than just visuals—visuals are just one layer among many, and Gen-1 merely added (low-fi) filters to that layer. Gen-2 goes one step further by generating those visuals whole cloth, and slightly better visuals at that, but it still barely scratches the surface of what actually attracts people to video as a medium.

I see this as a fundamental limitation that won’t change anytime soon. Even as more and more aspects of video workflow become automated (more photoreal and consistent video, AI-generated sound, editing, etc), a good video still relies on the confluence of so many factors that are incredibly difficult to automate. I don’t see a magic Hollywood movie-generating button replacing the human element of filmmaking, but rather, a gradual incorporation of AI tools into the still largely human process of moving image storytelling.

Making Oasis

The first Myst-inspired images I made with Midjourney v3–the nascent seed of Oasis

I’ve always been drawn to “mysterious island” narratives. The Invention of Morel, Treasure Island, The Wild Boys, The Strange Tale of Panorama Island, The Witness, and of course, Myst, all captured my imagination at pivotal times, and the balance between spatial constraints and imaginative expansiveness felt like a good sandbox to play around with. So the idea of a Myst-inspired short film had been floating around in my head for a long time—I initially started writing notes and storyboarding in 2019—but I never had the means to practically bring the vision to life. I always intended for it to be animated, to recreate the feeling of a lost 90s video game, but it became clear that it would be difficult to achieve by any means other than 3D software. I tried teaching myself Blender, but it would have taken me years to learn it well enough to be able to create the visuals I wanted, and probably more years to actually make it. So I shelved the idea, thinking I might return to it someday, but with skepticism that day would ever come.

When I started getting deep into Midjourney, it occurred to me that I might be able to finally take a stab at the island film. Software that can create whatever image you imagine, in any style? It naturally seemed like the perfect fit. But my initial experiments with Midjourney v3 were disappointing; it wasn’t consistent, the images weren’t quite right stylistically, I couldn’t cultivate a persistence of vision strong enough to make a cohesive film. When v4 came out last November, I tested the idea again, and it quickly became clear that it might actually be feasible.

The first Oasis images I generated with Midjourney v4

The images looked convincingly like old 3D graphics of a first-person POV game, the prompts worked consistently: an illusion of continuity could be created. And as soon as it became clear it was possible, I immediately got to work. The process was pretty smooth and intuitive—I began with a loose idea for the script, that I would take to Midjourney to create a flurry of images, then tweak the script based on those images, then write some more, then make some more images, etc. It was a nice feedback loop that felt sort of like spinning a film directly out of my brain, unfurling from the conversation between me and the AI’s latent image space. It’s either a happy accident, or cosmic destiny, that the film happened to also incorporate this interplay between reality and fantasy, dream and technology, into the core of its thematic fabric. Either the process of using AI unconsciously shaped my storytelling instincts in that direction, or the original seed of the idea was always looking towards a future that could only be realized when this magical technology happened to emerge. In any case, the film itself is probably a better reckoning with my thoughts on the past and future of artificial images than I could ever articulate in words.

I made the first “episode” of the initial version of Oasis (around 1.5 minutes) in one marathon 12-hour session, give or take some polishing. The gap between the length of time it would take for me to create the film manually vs with AI is pretty astonishing, collapsed from years to hours.

But on the other hand, the images are just one part of the film. Midjourney enabled me to practically implement the images of the film, but I still needed to have the overarching structure of the vision to create something good (and years of experience making, watching, and studying films). Aside from the images, I basically approached the other technical aspects in the same way I would any other film. I wrote a script, I edited in Premiere, I downloaded sound effects from the internet and made the music in Garageband. Holistically, it’s a human-created film that happens to consist of AI generated images as the baseline material, not all that different from a found-footage project. So if it is a good film (and I like to think it is), the AI only makes up a small percentage of the reason that it’s successful. I don’t say this to be ungrateful—I’m happy that Midjourney allowed me the means to finally make the film, but it’s ultimately more of a means to an end than an end in itself. And I don’t see AI videos being more than a means to an end for a long time, no matter how photorealistic they become.

Movies Are So Over… Not

A "Dinosaur Input Device" used by stop-motion animator Phil Tippett for Jurassic Park, a process combining stop-motion and CGI animation

It is an important caveat that Oasis is a Le Jetée style film edited together from still images, so it doesn’t exactly translate 1-to-1 with the type of full-motion videos generated by Gen-2. In some ways its visual language is closer to a graphic novel, or animatic, or a video game stream. But these distinctions are ultimately superficial—it’s still a film that tells a story through a succession of images and sounds. And all of film is just creating the illusion of motion through still images, mine just happens to be <1 frame per second rather than 24. The technical specs of any video are always just a vehicle for something else, a means to evoke a reaction between a hypothetical viewer and the screen. Although Oasis doesn’t necessarily resemble a typical film on the surface, it uses AI as a means to evoke a reaction in the audience in the same way any other film would.

Perhaps this is the sort of video best suited to the strengths of AI—rather than hoping text-to-video AI will be able to generate Hollywood-style feature films at the push of a button, AI can be used in targeted ways within a workflow to speed up or allow for the creation of new visual and narrative forms. Runway’s video tools beyond Gen-2, like their AI-assisted green screen rotoscoping, transcription, etc. might ultimately prove more liberating to filmmakers than text-to-video. I see it as analogous to special effects—in the recent documentary on Industrial Light and Magic, I was struck by how the transition between practical and digital effects mirrors the discourse around AI: the same fear of job loss, of the loss of soul, an era of authenticity falling to a new regime of cold calculating machines. But the reality didn’t turn out so dire. Films are still a mixture of practical and digital effects, jobs were changed more than lost, and digital tools have made filmmaking more accessible and been creatively liberating for new generations. The history of film, not to mention images in general, is the history of technology, and technological sea changes have always undergirded and guided the shifting tides of moving image culture. Yet, the real reason that people continue to be drawn to movies (story, emotion, humanity, communication), remain the same throughout every seismic shift.

Of course, all of this should be taken with a grain of salt. Maybe it’s more than a coincidence that I happen to see my film as the model for future AI-generated filmmaking. Maybe later iterations of Runway’s Gen-X will completely rewrite the book on AI filmmaking, opening up new possibilities that we could never imagine. Or maybe film will remain a medium that is hard to get right, takes a lot of thought and planning and skill, and relies on distinctly human proclivities to resonate with a human audience. Time, as always, will tell.