Believable and immersive audio

 

We recently had a chance to think about and hear about advanced and professional immersive video, something we confess we don’t cover enough here at fxguide. The first was at FMX and the clever recording of audio  and the second was regarding understanding ASAF for Apple Immersive Video, one of the most technically interesting ways to experience believable audio. The were not connected in terms of their sources or even their intended audiences, but maybe we all should consider audio more when thinking about how to achieve believable and immersive story telling.

FMX: how sound comes alive

John Montgomery and I wrapped another great week at FMX in Stuttgart, Germany, and while the conference was packed with the expected wealth of AI, visual effects, and emerging technology talks, one of the unexpected highlights for me was a session focused entirely on sound.

The talk was by Jens Rosenlund Petersen, Sound Editor at Cinphonic, and it was genuinely one of the standout presentations I saw this year.

Jens Rosenlund Petersen

Petersen is an Emmy, Golden Reel, and C.A.S. award-winning sound editor and mixer whose work spans feature films, documentaries, theatre, installation art, and immersive media. His credits include Bohemian Rhapsody, Moonage Daydream, I Wanna Dance With Somebody, Enola Holmes 1 & 2, and Saltburn. He has also recently worked as dialogue editor on Emerald Fennell’s 2026 film Wuthering Heights, starring Margot Robbie – and most interestingly for his FMX presentation, he also just completed the new hit Michael Jackson biopic Michael.

I confess, I more or less stumbled into this talk rather than deliberately seeking it out, but it was, as we say, “a cracker” of a session.

At fxguide, we are very familiar with the tricks, sleights of hand, and deeply technical workflows of visual effects. What was so refreshing was hearing the equally inventive and often wonderfully practical tricks used by sound recordists, editors, and mixers. Petersen’s presentation explored how to add texture, scale, and emotional depth to a soundtrack in post-production, using examples from dialogue, crowd recordings, music, and performance-based films.

One of the most striking examples came from I Wanna Dance With Somebody, the Whitney Houston biopic. For one of the concert sequences, the team had the original music recording, but they wanted it to feel as if it had genuinely been played in a large concert venue. Rather than simply applying a digital reverb preset, they hired a venue, played the track back into that space, and recorded it again.

But the clever part was that they played the track at double speed…

At first, that detail seems counterintuitive. Why play the track at twice the speed if you are trying to capture a natural concert ambience? Petersen explained that when the resulting recording was slowed back down, the reverb and reflections of the space also stretched in time. In effect, the echoes took twice as long to return. The music still played back at the correct speed, but the acoustic signature now felt as though it had been captured in a venue roughly twice the size. It is a wonderfully elegant trick: not relying on a plug-in to simulate scale, but physically recording a real acoustic space and then manipulating time to extend its perceived dimensions.

Petersen’s work is driven by a search for authenticity. He approaches sound not merely as an audio layer added after the fact, but as a physical experience. His background in theatre and installation art clearly informs this approach. He is interested in the way sound occupies space, how bodies move through that space, and how recordings can carry the weight, breath, and texture of real human presence.

That sense of physicality was also central to his work on Bohemian Rhapsody, particularly in recreating Queen’s famous 1985 Live Aid performance at Wembley Stadium. On screen, the sequence appears to take place in front of a vast, roaring crowd. In reality, much of that crowd energy had to be built after principal photography, with the performance staged against what was, acoustically, a far more controlled and much emptier environment. Petersen and the sound team had to construct the illusion of Wembley as a living, breathing mass of people.

This was not simply a matter of adding a generic stadium crowd track. Specific crowd voices, reactions, cheers, claps, and moments of human detail had to be placed carefully within the soundfield so that the crowd felt enormous, but also real. The result is one of those invisible achievements of sound post-production: the audience believes in the scale of the event because the soundtrack gives the stadium a body, a geography, and a pulse.

Another terrific example came from Michael. For the film, the team needed to recreate the overwhelming energy of a live crowd at a massive outdoor stadium concert. Rather than relying only on library material or generic crowd beds, they went to the actual football arena associated with the original performance and recorded crowd material there.

Petersen showed footage of crowds literally hopping past the microphones while chanting and screaming. In the film, the camera tracks from left to right, so he had the fans move in the opposite direction relative to the microphones, giving the sound the correct sense of motion across the frame.

Even better, the crowd was asked to hop as they moved past. This was not just for fun. The physical exertion changed their breathing, making them sound more like frenzied concertgoers rather than people standing still and pretending to be excited. It gave the recording a kind of bodily intensity that would be very hard to fake. This example perfectly captured the spirit of the talk. These were not necessarily expensive or technologically complex solutions. They were clever, practical, and deeply cinematic. They showed a kind of craft intelligence that is easy to overlook in a world increasingly focused on digital workflows and automated tools.

In the trailer, the first thing one hears is the crowd chanting “Michael.” It sounds like the front row of a massive concert, but Petersen explained that the chant began with only a dozen or so extras, recorded in an open field. Through careful recording, layering, positioning, and post-production treatment, those voices were transformed into the immediate, intimate pressure of fans pressed up against the stage. It is a perfect example of how sound post can manufacture scale without losing human specificity.

Petersen is part of the team at Cinphonic, an international audio post-production company specializing in cinematic sound for film, television, streaming, trailers, games, immersive media, and branded entertainment. The company works across sound editorial, sound design, dialogue editing, ADR, re-recording mixing, and immersive audio, including advanced formats such as Dolby Atmos. Its work reflects the same combination of creative storytelling and technical sophistication that Petersen demonstrated in his FMX presentation.

What made the session so memorable was not just the technical content, but the reminder that great sound post-production is often about imagination as much as technology. It is about understanding how a sound should feel, and finding the most inventive way to make that experience real.

FMX is always valuable for the talks you plan to attend. But sometimes, the real magic comes from the sessions you wander into almost by accident. Petersen’s talk was exactly that kind of discovery, and it was a brilliant reminder that cinema does not only come alive through what we see. Just as often, it comes alive through what we hear.

ASAF: Apple’s new immersive audio format beyond the cinema speaker model

Apple Immersive Video has already made a strong case for the importance of picture quality, scale, and presence in spatial media. But as anyone who has worked in immersive production knows, the image is only half the experience. The sound field is what often convinces the brain that the viewer is really somewhere else.

With Apple Spatial Audio Format, or ASAF, Apple is now providing a more precise production and delivery framework for immersive audio. Unlike a cinema speaker, the ASAF format not only places sound spatially, but it does so knowing how your head is oriented.  It adds to the quality of the recording, a quality of personalised listening.

ASAF is designed to work with Apple Immersive Video and is delivered using APAC, the Apple Positional Audio Codec. Apple describes APAC as the codec developed to deliver high-resolution ASAF content efficiently, keeping bitrates low while supporting playback across Apple platforms, (except your watchOS!).

AirPods Pro & Max with the Fairlight console showing the demo mix

In the ASAF demonstration file below, produced for fxguide by Ben Allan, is an explanatory mix, and it’s particularly useful because it strips the format back to its essentials. Rather than presenting a finished entertainment piece, the demo is designed to let you hear the basic building blocks of the format: head-locked sounds, world-locked sounds, externalised binaural sound, internalised sound, ASAF reverb, ambisonics, and positional objects.

The demo file itself is a sound mix encoded into APAC and wrapped in an MP4 file. In practical terms, it can be played in QuickTime Player on a Mac, and when monitored through compatible AirPods, the listener can hear the spatial effect much as it would be experienced inside the much more expensive Apple Vision Pro. Apple’s own documentation notes that Apple Immersive Video Utility expects audio to be imported as an MP4 containing ASAF, with fifth-order ambisonics and object-based audio recommended for immersive productions.

What makes the format interesting is not simply that sound can be placed around the listener. That has long been the promise of surround, Atmos, ambisonics, and binaural rendering. The key shift is that ASAF is designed for a viewing environment in which the listener is not merely seated in the centre of a virtual theatre. In Apple Immersive Video, the viewer’s head movement, gaze, and spatial relationship to the image all matter. Audio, therefore, needs to be stable where appropriate, responsive where necessary, and perceptually coherent with the immersive picture.

One of the simplest ideas shown in our demo is also one of the most important: the distinction between head-locked and world-locked sound. A head-locked sound remains fixed relative to the listener. It moves with you. This can be useful for narration, interface-like elements, or sounds that are intentionally internal or subjective. A world-locked sound, by contrast, remains fixed in the environment. If a voice is positioned high and to the left in the spherical panner, it should stay there as the listener turns their head.

Download ASAF Spatial Audio Demo with Ben Allen

The demo makes this spatial language clear by placing a voice at different distances and locations. One example places the voice two metres away in the spherical panner. Another moves it to five metres, and another to ten metres. Other examples position the voice high-left or low-right. These are not complex scenes, but they are highly effective in demonstrating the perceptual vocabulary that production teams will need when planning an ASAF mix.

 

Voice set to 2m in the Spherical Panner
Voice set to 10m in the Spherical Panner

 

This matters because immersive audio is still a relatively new territory for many picture-led production teams. In conventional film and television workflows, a director may speak in terms of front, rear, left, right, centre, music, dialogue, and effects. In ASAF, the conversation becomes more spatially nuanced. Is the sound attached to the viewer, to the world, or to the frame? Should it feel externalised in the environment or internalised in the listener’s head? Should the music sit in a theatre-like frontal position, or should it occupy a more spatial field?

The demo also highlights stereo music placed in a front “theatre” position, with identifiable left and right speakers. This is a useful reminder that immersive does not mean everything must surround the viewer. In fact, one of the challenges of immersive sound design is restraint. A mix can be spatially sophisticated while still preserving familiar cinematic conventions when they are appropriate. Music may sometimes work best as a frontal presentation, while environmental effects, point-source sounds, and spatial reverb provide the immersive cues.

Stereo music in the front “theatre” position with identifiable left & right speakers

ASAF brings together object-based audio and ambisonic approaches. Apple’s developer material describes the workflow as combining the Apple Spatial Audio Format with APAC delivery, and Apple has also released tools such as the ASAF Production Suite for Pro Tools through its developer portal.   Blackmagic Design has also demonstrated ASAF workflows in Fairlight, including how to set up, mix, spatialise, and deliver audio using APAC codecs.

For creators, this points to an important production question: ASAF is not just a delivery codec or a consumer playback trick. It is a format that asks sound teams to think spatially from the start. Much like stereo photography, high-frame-rate capture, or immersive camera staging, the best results are unlikely to come from treating the format as an afterthought at the end of post. The sound field needs to be authored with the same attention to viewer comfort, spatial continuity, and dramatic focus as the picture.

The demo notes also point to the role of ASAF reverb. Reverb is especially important in immersive work because it helps define the perceived space. A dry sound object can be precisely located, but without a convincing spatial acoustic context it may feel disconnected from the image. Conversely, too much reverb or badly matched environmental response can collapse the illusion. In immersive media, reverb is not simply a polish pass; it is part of the spatial scene construction.

This is why ASAF may be particularly important for Apple Immersive Video. AIV is not simply “video on a headset.” It is a production format built around presence, scale, and perceptual fidelity. Apple’s own sessions on Apple Immersive Video emphasise the engineering required to preserve fidelity, world scale, and viewing comfort.   The audio format has to carry the same burden. If the viewer turns their head and the picture behaves spatially but the sound does not, the illusion weakens immediately.

The most useful aspect of the ASAF demo is that it provides a shared language. Producers and directors do not need to become immersive audio engineers, but they do need to understand the creative implications of the format. A director should be able to ask whether a voice should be world-locked or head-locked. A producer should understand why additional time may be needed to test a mix on actual playback hardware. An editor should understand that cutting immersive picture may also involve preserving spatial audio continuity.

Head-locked Internal voice over displayed in the Fairlight Space View

Many of the concepts are technically sophisticated, but they are immediately understandable when heard. That is perhaps the most encouraging aspect of the format. Once you, as a listen, experiences the difference between a head-locked sound, a world-locked object, and a spatially placed voice at different distances, the terminology starts to become intuitive.

Settings for an ASAF Object in Fairlight

ASAF is still part of a developing production ecosystem, but it is already clear that it represents more than a new file type. It is part of the broader move from screen-based media toward spatial media, where image, sound, viewer position, and embodied perception all interact. For Apple Immersive Video to fully work, the audio cannot merely accompany the picture. It has to inhabit the same world.

 

Leave a Reply

Your email address will not be published. Required fields are marked *