The President of the People’s Republic of China, Xi Jinping singing ‘Imagine’.

In April fxguide reported on a new company, CannyAI, who were just starting to use Machine Learning to launch their VDR (Video Dialogue Replacement) process to replace the dialogue. To demonstrate the approach, they released the video (right) which shows world leaders singing to John Lennon’s ‘Imagine’.

The company is now offering this new service online via their web site as a new service. This is particularly interesting now during the COVID-19 crisis with the extreme difficulty of being able to produce new original material. With Canny AI’s VDR professional original scripted dialogue can be produced, to a point, without any need to mount a full shoot.

VDR is different from ‘Deepfakes’ as VDR replaces someone’s face, say fxguide’s Mike Seymour, with a new version of his own face but saying something different. It is not putting a different face on a body, which is the hallmark of a deepfake. VDR is part of a cluster of neural rendering technologies that use deep generative models. VDR works differently from traditional implicit computer graphics that recreate a face in 3D using surface geometry, material definitions, light sources, and animation. These light transport solutions use the render equation to produce photoreal results. In contrast, VDR synthesizes content rather than trying to simulate the physics of the real world and sample it. There are many neural rendering approaches which have the primary goal of generating controllable photo-realistic imagery via machine learning. VDR is also different from compositing or image processing approaches that aim to manipulate real-world captured footage to mimic different results – processes such as morphing, camera mapping, and direct face replacement compositing. The VDR process, therefore, requires training data, as many deep learning approaches in machine learning often do.

A typical neural rendering approach takes input images corresponding to certain scene conditions and builds a “neural” scene representation. From this, it can ‘hallucinate’ or render a new scene with synthesized novel images. The VDR process is greatly helped by placing the same, new lip-synced face onto the original source face, which naturally matches in fundamental shape and form, even if it is seen from a different angle and was originally saying something completely different. A great summary of the start of the art in neural rendering, including many complex technical aspects, can be found here, in a paper just published at Eurographics 2020.

We test drove the Canny AI service to work out the tips and tricks of this specialist form of visual effects, and we spoke to the founders about their additional new service offering a range of avatar actors as part of their AI Talent Agency.

Here is our Mike vs. Mike test VDR project:

Canny AI has built its VDR program so that one can just upload clips and they are processed remotely. Their offering is an end to end solution, that allows for the:

  • Dubbing of TV shows
  • Reuse of existing footage
  • Conversion of training videos to different languages

The process is semi-robust, it can handle most VDR challenges but it is not designed for complex action sequences or dramatic actor interactions with multiple facial occlusions, extreme lighting or other more dramatic cinematic sequences. It is currently ideal for dialogue to the camera that might be seen in commercials, training or corporate communications.

Tips and Tricks

In our clip above fxguide’s Mike Seymour talks to himself in two different languages, with his face ‘controlled’ by two other people. While this is simple in theory, in practice there are some interesting new aspects about how one should film and edit the material.

VFX: To make our test video we used standard VFX Nuke tools such as filming Mike twice with C-stands in the shot for eye line. These clips were then rotoscoped to produce one base edit.

What to film: For a sequence such as this one needs to provide the hero subject (Mike) and the new driving dialogue from our two additional actors. Additionally one needs to provide 2 mins of training data for Mike and the two extra actors.

Training Data: The system trains on both all the principles, but exactly what they are filmed saying is irrelevant, the only consideration is having clean footage and with, ideally, similar lighting and head angles. The footage should avoid motion blur, odd lighting or shadowing and have just the one face in frame at a time or be cropped to achieve this. We shot the material at 4K allowing for a reframing blow up – for a cutaway single shot. We also anticipated that the Machine Learning image processing could slightly soften the output face.

The edit: It is very tempting to edit the dialogue of the performance or face that will take over, but this is pointless. The editor needs to edit for the body language and not what is being said. In practice, clips can be partially reused in the edit, as the process will change the hero actor’s face and thus each clip will seem unique. Similarly, the head movements of the actors driving the new VDR are immaterial as the final head movements of the final individual is their original head movements. What is also kept is the eyeliner of the original performance. In other words, Mike’s eyes and head movements were blended with the new puppet version of Mike’s head saying the new lines.

Head-turns: Most Machine Learning or Deep Fake examples online use footage where the actor is facing directly to the camera, perhaps delivering a message directly at a press conference, etc, but there is no reason for this. While it is more difficult to have head turns, it is possible so long as there is adequate training material. What is more difficult is occlusion, from someone touching their face, or putting on or taking off glasses.

It is important to recall that the process does not understand the words that are being spoken. The way the team designed the process means that the VDR uses the audio files directly. The audio is not processed, such as is commonly done with audio to text, the way Alexa or Siri operates. This is important since it is the actual sounds that drive the process, which means that strong regional accents or other languages are not a problem.

Canny AI Talent Agency

Canny AI is also launching a virtual talent agency. The principle is a cross between the VDR work above and stock footage, except each clip, would be unique. With Stock Footage the price per clip is much lower since the footage is already shot, but of course, it is always the same content. With the Canny AI technology, each ‘stock footage’ clip of a hero actor can be delivering unique dialogue specific to one’s project.

“To create high-quality productions you need a studio with professional cameras, color grading, etc,” says co-founder Omer Ben Ami. “This can be quite expensive but you could just reuse the same content over and over again, like any stock footage, but with the actual dialogue you want”. Canny Ai is now working with commercial partners is to create content and that they can sell like stock footage, but with the correct rights management and quality control.

Quality Control works on three levels:

  • Canny AI has image process quality control that ensures the material looks correct, and the performer does not look visually wrong or odd, as this could both disappoint the client and damage the reputation of the actor.
  • There is rights management that the client obtains to use the new altered video commercially.
  • Finally, the talent in the clip has the right to opt-out of content. For example, if a vegetarian did not want to promote a hamburger product.

“We monitor that the videos follow the guidelines, we check the person is not a celebrity, or if they are then the footage is cleared to be used. Also that the material is not, say, malicious and that people can opt-out of their faces being used for certain types of content”, Ben Ami. explains.

The Talent Agency provides the new altered clip initially, with a watermark. If approved, the master clip can then be downloaded. The whole process is handled remotely.

The process can also be used with talent no longer alive. Below is Richard Nixon delivering a speech that was never filmed. It is part of a Nixon VDR documentary project. In Event of Moon Disaster premiered on 22nd November 2019 at the International Documentary Film Festival Amsterdam (IDFA) as part of the IDFA DocLab program. The piece aimed to illustrate the possibilities of Neural Rendering technology (sometimes referred to as deepfake technologies). This project reimagined what would have happened if the 1969 Apollo 11 mission had gone wrong and the astronauts had not been able to return home. A contingency speech for this possibility was prepared for but never delivered. Using the VDR process President Nixon is now recreated delivering this previously unseen speech.

While the Mike vs Mike project above did not require a voice actor or voice synthesis, the Nixon project did require synthesized speech and audio is naturally a key aspect of this area of  Neural Rendering and VDR applications.

“We did the Nixon piece with MIT. It was basically a speech of Nixon that he never really said. We worked with another company called Respeecher who did the speech conversion of Nixon’s voice,” recalls Ben Ami. “I think that’s a really good example of the cutting edge both for video and audio for that type of application.”


Respeecher applies deep learning to do speech processing for a spectrum of markets. “Our prototype first product allows you to speak with the voice of someone else e.g., a famous person,” CTO and co-founder Dmytro Bielievtsov commented online. Respeecher is a VC-backed start-up, creating a system that is focused on very high-quality output and usability for demanding applications like TV and movies, and gaming. They are based in Kyiv in northern Ukraine and consist of a team of twelve, although they are expanding. Their deep learning models are implemented using PyTorch. The PyTorch code is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, that has been actively developed since 2016 and Respeecher uses CUDA-capable Nvidia GPUs.

All that the Respeecher process needs to train their system are samples of the source voice (the voice of the actor who will perform the role) and the target voice (the voice that will be replicated). Not just the content, but all of the emotional content, the flow speed, peculiarities of word pronunciation, and part of the accent is transferred from the source voice.  The neural network combines the voice of the target speaker with the performance of the source speaker.

The technology is not able to synthesize voice in real-time yet. The Respeecher process currently requires a lot of time to synthesize the new voice, but Alex Serdiuk, the CEO, stated publicly last year that real-time processing, “is a clear and predictable engineering problem and we plan to solve it within 6 months.”

Looking Ahead

Canny AI is also working on being able to use similar Machine Learning approaches to produce plausible footage of someone saying arbitrary dialogue from just an audio file. “Obviously, for now, we use video clip inputs. We think the results, at least for now, are the most more accurate,” explains Ben Ami. “We are working on doing something similar from just audio, –  and doing it directly from audio basically also means that eventually, we could go directly from text to speech and generate a video from a written script.” For Canny AI the issue is less if this will be possible, but when the quality might be high enough to use. “We are working on this and as the technology improves we think we’re going to be able to incorporate it.”


Thanks: Special thanks to Wing Yiu and Nina Harding for their help in being a part of this test project.