1st SORA Music Video, How SORA is Evolving & Guessing Possible Pricing

SORA, as we discussed in our last fxguide story on Air Heads, is now being used by selected artists and directors in a very limited alpha testing phase. Seen very much as the video equivalent of DALL·E for moving video clips, SORA is a diffusion model that generates videos significantly longer and more cohesive than any of its rivals. Given the dramatic jump in image quality SORA has exhibited, there is naturally an enormous amount of interest in how it will affect VFX professionals. In our continuing series, we speak to filmmakers using this important new tool and document how their feedback is being incorporated into the quickly developing SORA UI and capabilities.

The Hardest Part

Paul Trillo is a multidisciplinary artist, writer, and director who has been using SORA for about two months now. His diverse body of work spans various genres and formats, and he is constantly pushing the boundaries of what’s possible in filmmaking. He has just released a music video made with SORA for indie chillwave artist Ernest Greene Jr., known musically as Washed Out.

“The Hardest Part” music video is from Washed Out’s new album “Notes From a Quiet Life” out June 28th. The roughly four-minute video, directed by Paul Trillo, zooms or tracks the camera through various scenes of a couple’s life together. The viewer sees the characters age and transition from 1980s high school kids to getting a married couple and having a child.

We have been covering SORA since it was made public. Thus, we were excited to see this new video, as SORA has been regularly updating its capabilities in line with its published February technical roadmap. In particular, this video seems to be delivering on the SORA stated aim of being able to “gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions.” – Or so we thought… To find out the current state of play for using SORA as a creative tool, we sat down with director Paul Trillo.

The Hardest Part video was made prior to the SORA multi-modal blending transition feature that was outlined in the original SORA technical roadmap. However, this blending feature is now in private testing with key lighthouse creatives such as Paul Trillo, but The Hardest Part was actually made prior to its recent release inside SORA. “The transitions were created with long-form prompts akin to writing a scene description,” Paul explained. “A few of the transitions were handled in After Effects.”

As we discussed in our previous fxguide story on Air Head, rather than being a one-button-press solution, as some people (outside OpenAI) have speculated, there is normally a very high shooting ratio for SORA projects right now and a lot of work to craft a sequence. (But OpenAI is responding rapidly, and this is changing—more on that below.)

For The Hardest Part video which is almost four minutes long, Paul estimates that he generated around 700 clips – most of these were not the full one minute, most were closer to twenty seconds, so roughly: “about 230 minutes of video was generated in total,” he estimates, “and I used about 55 of those clips.” These were all rendered at 720P resolution and then, as with Air Head, upscaled to 2K using Topaz.

The Cost of SORA?

Paul Trillo is on a private alpha, and we have no visibility into that relationship. But doing research away from this specific video, there is an accepted working estimate that about 5 minutes of video per hour per NVIDIA H100. The site factorial funds has an excellent breakdown on what that might mean for a possible SORA cost model.

There is a vast difference between training and inferring configurations. A SORA user only needs to use the inference, which is naturally much faster and cheaper than the enormous amount of computing power required to train a GenAI like SORA. We should also stress these numbers are in no way from OpenAI or in any way official (or off-the-record unofficial) estimates. That said, away from OpenAI-specific cloud rendering, – any ML professional would typically budget in the range of $13- $15 an hour for 8x L4 GPU (we have seen quotes such as $14/hr). Based on that estimate alone, the pure compute cost of inferring 230 minutes of SORA would be 230 minutes of SORA = 2,760 minutes of H100 = 46 hours @ $14 per hour = US$644.00. Plus, there would be upload and download costs as well as storage costs. This also ignores that OpenAI may add a margin or alternative subsidise any actual SORA release pricing. That being established,…as one post house TD pointed out to fxguide, ” That sort of pricing would be cheap for a professional -but expensive for a non-professional”.

It is also an insight as to why OpenAI’s CEO Sam Altman was quoted in the WSJ as being in talks to explore up to $7 trillion to develop silicon-chip manufacturing capacity, possibly involving the United Arab Emirates government, to power a new generation of artificial intelligence silicon chips. To put that in perspective, the entire US GDP is estimated in 2024 to be $28.78 trillion. Altan’s plan would be equal to a quarter of America’s current GDP, but it is clearly not to be invested by OpenAI in just one year. Balance that with the fact that SORA and programs like it will inevitably be used to make social media posts and TikTok currently has 17 million minutes of videos per day uploaded, and YouTube a staggering 43 million minutes of videos per day. If SORA was only used by a modest percentage of such social media users – the computer and power requirements required are nothing short of jaw-dropping.

Naturally, these are just the base compute expenses and in no way capture the creative invention, human time, and contributions, such as direction, plus editing, colour grading, and VFX, that are also required to make a clip as creative as The Hardest Part.

New MiniClips

One huge new innovation that was also not released when The Hardest Part was made is Miniclips. This is now in user testing but has only been available for the last couple of weeks. Paul points out that this only became available after completing his filmclip project.

MiniClips is a simple yet invaluable new SORA UI tool that allows a director like Paul to see the first four frames of either 8 or 32 mini-inferred clips. Once a text prompt is input, the user was previously required to wait until the whole clip ~ 20 seconds or minute of video, had been fully rendered before judging anything about it. Now, a director can ask to see the first four frames of a series of clips and judge if any are worth continuing to a full inference. Imagine your text prompt is for a car travelling on a country highway. Being able to see the first four frames of any inference quickly allows the basics of the shot to be checked – is it the right car?.. is the ‘country’ the correct type of location? Etc. This can save much trail and error and rapidly improve prompt engineering.

Prompt Engineering

Working out the right prompt to get what one creatively seeks is challenging. Currently, there is no full implementation of mutli-modal input, so any shot starts with just a text description. This is both SORA’s most remarkable achievement but also one of its limitations. It is easier to describe a scene if you are able to provide visual reference. This is why multi-modal is expected to be the next vast step up for SORA ease of use and why prompt engineering is so critical right now.

We asked Paul if he changed his prompts much along the timeline of an individual shot? In SORA, after an initial prompt is submitted, you can vary the prompt over the timeline of the clip to allow for different things to be seen and viewed at various points in the clip. Without this, things wouldn’t leave the frame, and the creative focus of the shot could not change during the shot. “Yes, the prompt for each shot is quite long,” he explained. “Allowing for multiple beats and scenes to fly through within a single generation. Once I had a formula to get the camera move, speed, and character, I would continue to re-edit to create new scenes”.

Here is an example of one of the actual prompts used in The Hardest Part:

continuous shot moving forward zooming through time, with a view of 1980s highschool hall corridor with checkered tiled floor, buzzing with students walking around. the scene is captured from a low angle front perspective, showing a door at the end of the corridor getting bigger and closer. the scene is blurred, indicating a high speed movement. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through a time and through the doorway, with a view of a open classroom of students dressed in 80s attire. we zoom through students looking to the front of the class room rushing in front of the lens. the classroom has a distinct 80s feel. the scene is captured from a front perspective, showing the students getting bigger and bigger we see two students, a male student with dark hair and jean jacket making eye contact with a female student also in a jean jacket. the female student is chewing bubblegum and make a bubble from pink bubble gum. the scene is blurred, indicating a high speed movement. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through the classroom, with a 18 year old boy with dark hair and jean jacket making eye contact with a female student also in a jean jacket. the female makes a bubble with pink bubblegum in front of the lens. we zoom through the bubble it pops and we zoom through the bubblegum and enter an open football field. the scene is moving rapidly, showing a front perspective, showing the students getting bigger and faster. the scene is blurred, indicating a high speed movement. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through an open football field overcast, from the 1980s, with the bleachers in the background distance. in the center of the shot is the same guy and girl in jean jackets with their back to camera walking in the field. we see they are holding hands the camera narrows in zooming in toward their hands clutching. the scene is moving rapidly, showing a front perspective of their hands getting bigger and closer. we zoom toward the bleachers in the background, the scene is blurred, indicating a high speed movement. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through the couple’s hands holding, we zoom through the bleachers in background of the football field and through a moody forest of trees at night with the neon glow of the city in the background is out of focus with bokeh. the city is out of focus behind the trees at night. the scene is captured by the camera in a fast and smooth movement. the scene is blurred, indicating a high speed movement. the trees have an opening a tunnel at the center that we enter. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through the opening between the dark moody forest trees and we enter to a look out point at the top of a hill with a view of the out of focus city lights shimmering in the background. we zoom in toward an 80s car parked a the top of the hill with it’s red taillights illuminated the grassy hill, the the lookout point and car scene is quaint and peaceful. the scene is moving rapidly, showing a front perspective of the town getting smaller and further at night. the scene is blurred, indicating a high speed movement. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPVcontinuous shot moving forward zooming through the nightime lookout point zooming through the back window of an 80s car and into the interior of the 80s car where the young couple are seating in the front seat and are leaning in toward each other, with a view of a out of focus city in the background through the car windshield, the scene is moving rapidly, showing a top view of the city. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV, continuous shot moving forward zooming through the interior of the 80s backsetat car where the couple are seating in the front seat and lean in to each other, with a view of a out of focus city in the background through the car windshield. the scene is moving rapidly, showing a straight view of the out of focus city outside the car windshield. we zoom between the faces of the young couple as they lean in toward each other. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. • One point perspective FPV,continuous shot moving forward zooming through the front seat of the car toward the young couple leaning in toward each other and we zoom out the windshield into the city at night repeating new york library with large aisles, with a counter, shelves, and products. the library is large and crowded, is in a new york city we zoom into a woman reading a book looking over their shoulder she is holiding a book up, the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. •One point perspective FPV, continuous shot moving forward zooming through infinitely through the windshield into the out of focus city at night, we zoom in and drop down to the city at night zooming through the street, through the street lamps, we zoom into the young couple walking down the middle of the street at night, the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. •One point perspective FPV, continuous shot moving forward zooming through an infinitely down the street at night and we see the couple again laughing and running under the lights at night in a suburban street, looking over their shoulder we land in a close up shot of the book. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is moody and cinematic, with a slight vignette and a warm, vintage tone. the shot is captured on 35mm film, fuji film stock from the 90s with an anamorphic 24mm lens. motion blur as we zoom continuous shot, analog film. One point perspective FPV

SORA is trained on LLMs but not specifically on cinematic terms , which has led other SORA fimmakers to express their prompts not in a cinematic language but a more traditional set of terms. For example, some people have expressed difficulty using terms like crane up or communicating the difference between a ‘panning shot’, a ‘tracking shot’ and a ‘dolly shot’. Paul comments that SORA understood specific terms like zooming and FPV perspective and used these to create his kinetic dolly move. “Technically, the video is not zooming. It’s a dolly push, but that language doesn’t land as well,” he explains. But SORA does “understand ‘motion blur’ and ’35mm film stock’, ’80s and retro’ and these terms certainly played a crucial part in creating this music video’s more organic filmic looks. “My initial challenge was to see if I could create something that felt less like a video game a more like a strange feed into memories from another dimension. There were a lot of terms it simply ignores, so you have to learn to speak its language to some degree.”

We asked if he had any advice on Prompt engineering for filmmakers. “Experiment, throw weird things at it, fail, fail, and try again. Use your mind’s eye to envision exactly what you want to see and try to break it down like you’re talking to a child,” he responded.

Lessons Learnt from Previous Projects

In comparison to your Paul’s previous non-SORA AI project Notes To My Future Self, we asked how much post-production he did in terms of colour grading/compositing in the SORA-generated The Hardest Part. “I did a color pass and two transitions in After Effects, but everything else is pretty much raw out of SORA, including the overall aesthetic. I wanted to lean into the strange, surreal hallucinations from SORA and not cover it up.”

The previous Notes To My Future Self, was a very different approach in that the former project embaced how it might be used with a host of standard VFX tools, but they naturally shared a similar creative aesthetic and vision. “They’re pretty different projects conceptually however they share a common ground in that they explore a dream like view of the world,” he adds “I find the best use cases for AI are using it to represent memories and dreams.”