And the world just shifted on its axis (again): SORA video AI

OpenAI has just released SORA. This is the video equivalent of DALL·E for videos. It is a game changer – ‘yeah, the (VFX) world just shifted on its axis again!’

According to OpenAI, SORA is a diffusion model that generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps. SORA can generate entire videos all at once or extend generated videos to make them longer. By giving the model foresight of many frames at a time, they have solved the challenging problem of making sure a subject stays consistent even when it goes out of view temporarily.

A diffusion model is one of the key technologies of deep learning and generative models. It refers to a type of generative model that can produce high-quality, diverse ‘samples’ that mimic a given data distribution.  In this case, the data distribution is photoreal images in a range of cinematic styles. But don’t be confused; it does not model in 3D, it does not understand physics or light, nor does it understand cameras, digital or otherwise. It produces statistically plausible images, and now it can do that in a way that seems temporally consistent. In other words, the videos seem not to jitter between frames, and they (sort of) make sense over time. We have seen many prior attempts at this, but nothing as good or for as long as SORA. The new OpenAI SORA is producing plausible (for the most part) 60 second videos – much longer than was previously possible.

In the rapidly evolving landscape of digital content creation, this newest development has emerged to revolutionize how we think about generative art and synthetic media. At its core, a diffusion model is a sophisticated algorithm that learns to generate complex data, such as images, by gradually refining random noise into coherent patterns that resemble the training data or the high-quality clips it was shown to emulate or imitate. It does not directly use the source training data, but it learns what sort of pixel values are in what sort of shot with what sort of scene. Stating this simplified explanation makes it sound like magic, but in fact, it can only do this due to the VAST amount of material it trains on.

The journey of a diffusion model begins with the concept of “diffusing” the data, akin to adding layers of noise to an image until it becomes indistinguishable from random static. This process is methodically then reversed during generation. The model learns to carefully remove the noise, step by step, eventually revealing a new piece of content that mirrors the characteristics of its training set. This intricate dance between adding and removing noise is where the ‘magic’ happens, allowing the model to explore a vast landscape of possibilities before settling on a final output. As you can imagine, doing this for a still image is vastly simpler than a moving clip.

SORA videos are made from collections of smaller units of data called patches, each of which is equivalent to a token in GPT. “By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios,” OpenAI has posted, Sora clearly “builds on the groundbreaking research of DALL·E and ChatGPT models that OpenAI has developed.” SORA uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.

What sets diffusion models apart and would be of particular interest to the VFX community is their unparalleled ability to produce images of stunning quality and diversity. This makes them especially appealing for applications in VFX, concept art, and even virtual production, where the ability to quickly generate varied, high-quality assets can significantly streamline the production process.

In the realm of VFX, imagine the potential of using diffusion models to generate photorealistic textures, backgrounds, or even complex character designs. The iterative nature of the model’s processing allows for an unprecedented level of control and fine-tuning. But it does not immediately enable artists to guide the generation process towards an exact outcome, the way a director may direct live action. But – and it is a huge ‘but’ – SORA is remarkable for how quickly it has appeared. Even experts such as ourselves thought this level of moving diffusion models would be at least another year to 18 months away from happening. SORA is a mic dropping – jaw dropping – drop everything – wake-up call to artists and TDs worldwide to get fluent in Machine Learning and modern AI tools.

The flexibility and adaptability of diffusion models mean they can be trained on specific styles or artistic genres, offering filmmakers and content creators a powerful tool to realize their unique visions. Whether generating eerie landscapes for a sci-fi thriller or crafting intricate costumes for a period drama, diffusion models stand ready to redefine the boundaries of what is possible to produce creativity in an incredibly short amount of time.

As we stand on the brink of this new era, it’s clear that diffusion models are not just another tool in the digital artist’s toolkit—they represent a paradigm shift in content creation. With their ability to learn from vast datasets and produce results that can astonish, inspire, and even blur the line between the real and the virtual, diffusion models promise to be a cornerstone of the next generation of digital storytelling and VFX.

As amazing as SORA is (and the promise of diffusion models are), they are not one-button press feature film generators. It is easy to extrapolate that all jobs will be blown away, and everything will be AI-generated, but that is unlikely. When digital effects broke into the world of visual effects, it is true that a lot of traditional visual effects jobs went away. There was quickly no need for optical printers, traditional matte painters, and many other talented crafts. However, many more people are employed in visual effects today than ever before, not just a few more, – but thousands more.

Note the ‘micro problem’ : the dog is both in front of the shutters and on the ledge – back behind the shutters.

There are two levels to think about what SORA can’t do. It has micro problems, if you watch the videos then as impressive as they are, there are many minor problems – hands that are misformed, animals melting into one another, or just disappear. These are problems that one might reasonably expect to be fixed relatively quickly. If we look at where this technology was a year ago, – it is unbelievable how far the realism has jumped in just 12 months. However, if we think at a higher level of abstraction, these tools are not telling stories; they are mimicking styles. They are insanely impressive but a long way off from delivering a complex performance with subtext over a long character arc. They cannot produce a season of Better Call Saul or an episode of The Late Show with Stephen Colbert. People care about people; they care about stories about people, – which is why there are no blockbuster timelapse feature films. People care about what people have to say and who made the art they are watching. SORA and the generation of similar tools that will follow will have profound effects on VFX, but it won’t replace all filmmakers. The flip side of that logic is that if you have been putting off getting up to speed with ML tools like diffusion models – you might want to revisit that strategy.

As we delve deeper into the capabilities and applications of diffusion models, it’s evident that their impact on the visual effects industry and beyond will be profound. From revolutionizing the way we produce digital assets to opening new avenues for artistic expression, diffusion models are poised to become an indispensable tool in VFX.

 

  • And yes, ChatGPT aided in the writing of this story.