Actually Using SORA

In February, we pushed our first story on SORA; OpenAI had just released the first clips from SORA, which we described at the time as the video equivalent of DALL·E for videos. SORA is a diffusion model that generates videos significantly longer and with more cohesion than any of its rivals. By giving the model foresight of many frames at a time, they have solved the challenging problem of ensuring a subject stays consistent even when it goes out of view temporarily. SORA can generate entire videos, all at once up to a minute in length. At the time, OpenAI also published technical notes indicating that it could (in the future) extend generated videos to make them longer or blend two videos seamlessly.

Several select production teams have been given limited access to SORA in the last few weeks. One of the most high-profile was the Shy Kids team, who produced the SORA short film Air Head. Sidney Leeder produced the film. Walter Woodman was the writer and director, while Patrick Cederberg was responsible for the post-production. The Toronto team have been nicknamed “punk-rock Pixar”, while their work has garnered Emmy nominations and been long-listed for the Oscars. We sat down this week with Patrick for a long chat about the current state of SORA.

Shy Kids is a Canadian production company renowned for its eclectic and innovative approach to media production. Originating as a collective of creatives from various disciplines, including film, music, and television, Shy Kids has gained recognition for its unique narrative styles and engaging content. The company often explores adolescence, social anxiety, and the complexities of modern life while maintaining a distinctively whimsical and heartfelt tone. Their work showcases a keen eye for visual storytelling and often features a strong integration of original music, making their productions resonant and memorable. Shy Kids has successfully carved out a niche by embracing new AI technology and creativity, pushing what is possible.

SORA : Mid-April ’24.

SORA is in development and is actively being improved through feedback from teams such as Shy Kids, but here is how it currently works. It is important to appreciate that SORA is effective almost pre-alpha. It has not been released nor is it in beta.

“Getting to play with it was very interesting,” Patrick comments. “It’s a very, very powerful tool that we’re already dreaming up all the ways it can slot into our existing process. But I think with any generative AI tool; control is still the thing that is the most desirable and also the most elusive at this point.”

UI

The user interface allows an artist to input a text prompt; OpenAI’s ChatGPT then converts this into a longer string, which triggers the clip generation. At the moment, there is no other input; it is yet to be multimodal. This is significant as while SORA is correctly applauded for its object consistency during a shot, but there is nothing to help make anything from the first shot match in a second shot. The results would be different even if you ran the same prompt a second time. “The closest we could get was just being hyper-descriptive in our prompts,” Patrick explains. “Explaining wardrobe for characters, as well as the type of balloon, was our way around consistency because shot to shot / generation to generation, there isn’t the feature set in place yet for full control over consistency.”

The individual clips are remarkable and jaw-dropping for the technology they represent, but the use of the clips depends on your understanding of implicit or explicit shot generation. Suppose you ask SORA for a long tracking shot in a kitchen with a banana on a table. In that case, it will rely on its implicit understanding of ‘banana-ness’ to generate a video showing a banana. Through training data, it has ‘learnt’ the implicit aspects of banana-ness: such as ‘yellow’, ‘bent’, ‘has dark ends’, etc. It has no actual recorded images of bananas. It has no ‘banana stock library’ database; it has a much smaller compressed hidden or ‘latent space’ of what a banana is. Every time it runs, it shows another interpretation of that latent space. Your prompt replies on an implicit understanding of banana-ness.

Prompting the right thing to make Sonny

For Air Head, the scenes were made by generating multiple clips to an approximate script, but there was no explicit way to have the actual yellow balloon head the same from shot to shot. Sometimes, when the team prompted for a yellow balloon, it wouldn’t even be yellow. Other times, it had a face embedded in it or a face seemingly drawn on the front of the balloon. As many balloons have string, often the Air Head character, nicknamed Sonny, the balloon guy, would have a string down the front of the character’s shirt. Since it implicitly links string with balloons and thus these would need to be removed in post.

An unwanted face on the balloon, from a raw SORA output.

Resolution

Air Head is only using SORA-generated footage, but much of it was graded, treated, and stabilised, and all of it was upscaled or upresed. The clips the team worked with were generated at a lower resolution and then upresed using AI tools outside SORA or OpenAI. “You can do up to 720 P (resolution),” Patrick explains. “I believe there’s a 1080 feature that’s out, but it takes a while (to render). We did all of Air Head at 480 for speed and then upright using Topaz.”

Prompting ‘time’: A slot machine.

The original prompt is automatically expanded but also displayed along a timeline. “You can go into those larger keyframes and start adjusting information based on changes you want generated.” Parick explains, “There’s a little bit of temporal control about where these different actions happen in the actual generation, but it’s not precise… it’s kind of a shot in the dark – like a slot machine – as to whether or not it actually accomplishes those things at this point.” Of course, Shy Kids were working with the earliest of prototypes, and SORA is still constantly being worked on.

In addition to choosing a resolution, SORA allows the user to pick the aspect ratio, such as portrait or landscape (or square). This came in handy on the shot that pans up from Sonny’s jeans to his balloon head. Unfortunately, SORA would not render such a move natively, always wanting the main focus of the shot—the balloon head—to be in the shot. So the team rendered the shot in portrait mode and then manually, via cropping, created the pan-up in post.

Prompting camera directions

For many genAI tools, a valuable source of information is the metadata that comes with the training data, such as camera metadata. For example, if you train on still photos, the camera metadata will provide the lens size, the f-stop and many other critical pieces of information for the model to train on. With cinematic shots, the ideas of ‘tracking’, ‘panning’, ’tilting’ or ‘pushing in’ are all not terms or concepts captured by metadata. As much as object permanency is critical for shot production, so is being able to describe a shot, which Patrick noted was not initially in SORA. “Nine different people will have nine different ideas of how to describe a shot on a film set. And the (OpenAI) researchers, before they approached artists to play with the tool, hadn’t really been thinking like filmmakers.” Shy Kids knew that their access was very early, but “the initial version about camera angles was kind of random.” Whether or not SORA was actually going to register the prompt request or understand it was unknown as the researchers had just been focused on image generation. Shy Kids were almost shocked by how much the OpenAI was surprised by this request. “But I guess when you’re in the silo of just being researchers, and not thinking about how storytellers are going to use it… SORA is improving, but I would still say the control is not quite there. You can put in a ‘Camera Pan’ and I think you’d get it six out of 10 times.” This is not a unique problem nearly all the major video genAI companies are facing the same issue. Runway AI is perhaps the most advanced in providing a UI for describing the camera’s motion, but Runway’s quality and length of rendered clips are inferior to SORA.

Render times

Clips can be rendered in varying segments of time, such as 3 secs, 5 sec, 10 sec, 20sec, up to a minute. Render times vary depending on the time of day and the demand for cloud usage. “Generally, you’re looking at about 10 to 20 minutes per render,” Patrick recalls. “From my experience, the duration that I choose to render has a small effect on the render time. If it’s 3 to 20 seconds, the render time tends not to vary too much from between a 10 to 20-minute range. We would generally do that because if you get the full 20 seconds, you hope you have more opportunities to slice/edit stuff out and increase your chances of getting something that looks good.”

Roto

While all the imagery was generated in SORA, the balloon still required a lot of post-work. In addition to isolating the balloon so it could be re-coloured, it would sometimes have a face on Sonny, as if his face was drawn on with a marker, and this would be removed in AfterEffects. similar other artifacts were often removed.

Editing a 300:1 shooting ratio

The Shy Kids methodology was to approach post-production and editing like a documentary, where there is a lot of footage, and you weave a story from that material rather than strictly shooting to a script. There was a script for the short film, but the team needed to be agile and adapt. “It was just getting a whole bunch of shots and trying to cut it up in an interesting way to the VO,” Patrick recalls.

For the minute and a half of footage that ended up in the film, Patrick estimated that they generated “hundreds of generations at 10 to 20 seconds a piece”. Adding, “My math is bad, but I would guess probably 300:1 in terms of the amount of source material to what ended up in the final.”

Comping multiple takes and retiming

On Air Head, the team did not comp multiple takes together. For example, the shots of the balloon drifting over the motor racing were all generated in the one shot fairly much as seen. However, they are working on a new film that mixes and composites multiple takes into one clip.

Interestingly, many of the Air Head clips were generated as if shot in slow motion, while this was not requested in the prompt. This happened for unknown reasons, and so many of the clips had to be retimed to appear to have been shot in real-time. Clearly, this is easier to do than the reverse of slowing down rapid motion, but still, it seems like an odd aspect to have been inferred from the training data. “I don’t know why, but it does seem like a lot of clips at 50 to 75% speed,” he adds. “So there was quite a bit of adjusting timing to keep it all from feeling like a big slowmo project.”

Lighting and grading

Shy Kids used the term ‘35 mm film‘ in their prompts as a keyword and generally found that the prompt 35mm gave a level of consistency that they sought. “If we needed a high contrast, we could say high contrast, and say key lighting would generally give us something that was close,” says Patrick. “We still had to take it through a full color grade, and we did our own digital filmic look, where we applied grain and flicker to just sort of meld it all together.” There is no option for additional passes such as mattes or depth passes.

Copyright

OpenAI is trying to be respectful and not allow material to be generated that violates copyright or produces images that would appear to be from someone they are not. For example, if you prompt something such as 35mm film in a futuristic spaceship, a man walks forward with a light sword, SORA will not allow the clip to be generated as it is too close to Star Wars. But the Shy Kids accidentally bumped into this during early testing. Patrick recalls that when they initially sat down and just wanted to test SORA, “We had that one shot behind the character’s back; it’s kind of that Aronofsky following shot. And I think it was just my dumb brain, as I was tired, but I put ‘Aronofsky type shot’ in and got hit with a can’t do that.,” he recalls. Hitchcock Zoom was another thing that came up as something that is now by osmosis, a technical term, but SORA would reject the prompt for copyright purposes.

Sound

Shy Kids are known for their audio skills in addition to their visual skills. The music in the short film is their own. “It was a song we had in the back catalogue that we almost immediately decided on because the song’s called The Wind, ” says Patrick. “We all just liked it.”

Patrick himself is the voice of Sonny. “Sometimes we’d feel pacing-wise the film needed another beat. So I would write another line, record it, and come up with some more SORA generations, which is another powerful use of the tool in the post: when you’re in a corner, and you need to fill a gap, it’s a great way to start brainstorming and just spit clips out to see what you can use to fill the pacing problem.”

Summary

SORA is remarkable; the Shy Kids team produced Air Head with a team of just 3 people in around 1.5 to 2 weeks. The team is already working on a wonderful, self-aware, and perhaps ironic sequel. “The follow-up is a journalistic approach to Sonny, the balloon guy, and his reaction to fame and subsequent sort falling out with the world,” says Patrick. “And we’re exploring new techniques!” The team is looking to be a bit more technical in their experimentation, incorporating AE composting of SORA elements into real live-action footage and using SORA as a supplementary VFX tool.

SORA is very new, and even the basic framework that OpenAI has sketched out and demonstrated for SORA has yet to be available for early tests to use. It is doubtful that SORA in its current form will be released anytime soon, but it is an incredible advance in a particular type of implicit image generation. For high-end projects, it may be a while before it allows the level of specificity that a director requires. It will be more than ‘close enough’ for many others while delivering stunning imagery. Air Head still needed a large amount of editorial and human direction to produce this engaging and funny story film. “I just feel like people have to SORA as an authentic part of their process; however, if they don’t want to engage with anything like that, that’s fine too.”