Canny AI: Imagine world leaders singing

Deep Learning is really starting to establish itself as a major new tool in visual effects. Currently the tools are still in their infancy but they are changing the way visual effects can be approached. Instead of a pipeline consisting of modelling, texturing, lighting and rendering, these new approaches are hallucinating or plausibly creating imagery that is based on training data sets.

Machine Learning, the superset of Deep Learning and similar approaches have had great success in image classification, image recognition and image synthesis. At fxguide we covered Synthesia in the UK, a company born out of research first published as Face2Face. Synthesia are seeking to address existing production problems in language dubbing and ADR. ‘Native dubbing’ is their new method of translating video content that utilises AI or Machine Learning to synchronise the lip movements of an actor to a new dialogue track by a different actor.

The President of the People’s Republic of China, Xi Jinping singing ‘Imagine’.

Now a new company has emerged aimed at a similar market but with a different technical implementation. Canny AI is launching their VDR™(Video Dialogue Replacement) process to replace the dialogue in any footage. To demonstrate the approach, they released the video above which shows world leaders singing to John Lennon’s ‘Imagine’.

Canny AI is an early stage startup in Tel-Aviv, Israel. The company’s two founders are both ex-Israeli Army. Omer Ben-Ami described that he was a “software developer in the army unit here and then the intelligence unit, until I did a PhD in Theoretical Physics.” He was also a developer and worked in Israeli start ups. His co-founder is Jonathan Heimann who explained that he “studied computer science in Tel Aviv University and then I joined the army for over six years.”

3% is a Brazilian dystopian thriller web television series created by Pedro Aguilera. In Israel, the show is dubbed from the original Portuguese and is very well-received. “I was watching this TV show, and it’s very popular in Israel. But the experience was very bad,” recalled Ben-Ami. “Then we started to see what people were publishing in this area and at that time, there wasn’t too much. Just mainly the Obama paper by a University of Washington. That is when we basically started seeing what we could do, and how we could create the perfect lip sync experience.”

University of Washington Inspiration

Above is the original University of Washington video, which demonstrates their tool developed by computer vision researchers, led by Supasorn Suwajanakorn, to create realistic video from audio files alone. In this example above, the team created realistic videos of Obama speaking in the White House, using the audio file from a television talk show and from an interview recorded decades ago. The paper was presented at SIGGRAPH 2017. The team chose Obama because the machine learning technique needs available video of the person to learn from, and there were hours of presidential videos in the public domain. For this video, Suwajanakorn used 14 hours of Obama footage. This approach used a recurrent neural network to convert audio into key mouth shapes (a sparse set of shape coefficients). They then synthesized the texture, enhanced details such as the teeth, and composited the new mouth onto the head and background of the source video. This is a highly complex problem as often your mouth moves before you actually say a word, so it’s not enough to condition the mouth shape on ‘past’ audio input – the neural network needs to look into the future.

The UW 2017 paper focused on the “mouth, chin, cheeks, and area surrounding the nose and mouth”, with some jaw correction. The rest of Obama’s appearance (eyes, head, torso, background) comes from the stock footage of Obama. The UW final solution was a composite produced by pyramid blending of four layers that are blended in the following order from front to back:

  1.  Lower face texture (excluding the neck),
  2.  Torso (shirt and jacket),
  3.  Neck, and
  4. The rest.

Parts 1 and 3 come from the synthesized texture, while parts 2 and 4 come from the target frame.

Face2Face

The Canny AI team also saw Face2Face: Real-time Face Capture and Reenactment of RGB Videos, which was another landmark development in face replacement technology. This was first shown in the Emerging Tech Hall at SIGGRAPH 2016, but at that time, “Face2Face wasn’t very accurate in terms of the lip sync. So we saw potential in  improving that area,” commented Ben-Ami. Face2Face‘s Prof. Matthias Niessner would go on to form Synthesia.

The Imagine Video

To promote Canny AI’s VDR approach, the team have released a video (top of the story).The video was made purely as a demonstration, but with the hope that it would imply a positive message that this general style of AI processing need not be associated only with the ethical issues surrounding ‘Deep Fakes’.

 “There’s a lot of hype on that, around the fake news with this technology and we wanted to do  something with a strong unifying message, to show some positive uses for this technology.”      Co-founder Omer Ben-Ami 

VIDEO: The source that drove the lips of Kim Jong-un or Justin Trudeau: “That would be me,” exclaims Jonathan Heimann, “That would be me singing or lip syncing ‘Imagine’, very badly!” (see above).

Kim Jong-un lip sync was composited traditionally into a stock library clip of an iphone

The ‘Imagine’ video was made using stock footage clips from a stock library for the ‘global audiences’. These clips had the treated video composited into them using traditional methods. Israeli studio The Hive donated their time and resources to do the compositing and editing of the final song.

The Canny AI team produced long clips of the various world leaders singing most of the song, and then the editors decided which leader would work for which part of the edit. In reality, a lot more material was generated than was needed. This reflects the fact that training the network takes time, but once it is working, the actual process is very quick.

In fact, speaking in rough terms, if a 30 minute training video wanted to be converted to another language, the team estimate they could easily turn it around in a couple of days, assuming the same presenter is used throughout.

Canny AI

As a result of their interest in accurate lip sync, for the last year Canny AI have been working on developing their VDR program. Their offering is an end to end solution, that would allow for the:

  • Dubbing of TV shows
  • Reuse of existing footage
  • Conversion of training videos to different languages

The company is now taking on projects actively, bidding jobs and doing key tests for some major potential clients.

Unlike some other original methods, the process does not require hours and hours of training material and the team have been focused on the key complex problems that happen in any such setup:

  1. Temporal flicker
  2. Problems with perspective during head turning
  3. Lighting changes and matching

At the moment they are leaving aside the issue of occlusion, assuming more traditional methods will provide the solution to these special cases.

The training data that the team requires is much less than some other methods. In the case of a shot (that did not make the edit) of President Trump singing, but shot from the side, Ben-Ami commented that, “in that case I think there was only 60 frames but that was enough for us to be able to recreate his lips.” This is because the team already had enough training footage of other lips from that angle, and so “the idea is if you have enough samples of lips from that angle, and you want new lips, as long as they are in the same space as the AI saw prior – that should work.”

To solve the lighting, the process works scene by scene, adjusting for each. For temporal flicker, the team worked long and hard to develop special IP that addresses this issue. Their results don’t flicker and in fact, in the sample video, there was no special post production or visual effects done on the imagery of the leaders, themselves. While the whole frame may have been composited into various stock footage shots, the actual singing clip is as it came out of the process, and was not separately treated in AE, Flame or Nuke. For Canny AI, the target is very high quality lip sync and no post processing to hide flicker or glitches.

The company is only two people right now, Omer Ben-Ami and Jonathan Heimann, but they have two key advisors: Dr. Uri Shaham and Michael Hamilton. “It’s the two of us and we have two advisors. One of them did their PhD at Yale in Statistics and the other one is from the movie industry. He works in audio post production in business development,” explains Ben-Ami.

LEFT (NEW) above is re animated with the voice of South Korean president RIGHT (SOURCE)

The technology works of course both ways. Above is the South Korean president, Moon Jae-in. What is significant in the test above is the use of footage from different angles, and the robustness of the system to the angle of the source video, as seen here.