NVIDIA team wins Real-Time Live

Best in Show

NVIDIA’s I am AI: AI-driven Digital Avatar Made Easy  won Best in Show at Real-Time Live.

NVIDIA researchers win Best in Show award at SIGGRAPH 2021 with real-time digital avatar technology. Real-Time Live is one of the most anticipated events at the world’s largest computer graphics conference. It was held virtually this year. Each year the show celebrates cutting-edge real-time projects spanning game technology, augmented reality and scientific visualization.

In the demo, two NVIDIA research scientists played the part of an interviewer and a prospective hire speaking over video conference. During the call, the interviewee showed off the capabilities of AI-driven digital avatar technology to communicate with the interviewer. The demo used a NVIDIA RTX laptop and a desktop workstation powered by RTX A6000 GPUs. The entire pipeline can also be run on GPUs in the Cloud.

We previously previewed their entry and we also spoke to Ming-Yu Liu, Distinguished Research Scientist, Koki Nagano, Senior Research Scientist). People who were also in the Real-Time Live demo included Arun Mallya, Senior Research Scientist,  Kevin Shih Research Scientist, and Bryan Catanzaro, Vice President.

The Real-Time Live demo was an end-to-end demonstration of research projects that have been combined to show the creation of a digital conferencing – digital human project. What starts as a demonstration of very fast plausible face creation moves through to an end setup where someone is just typing their responses with a digital version of themselves talking and saying their words in their own voice. “There are a couple of key technologies that are combined together to enable the demo,” comments Liu. “There is a technology that lets you take a photo of a target person and video and transfer the motion in the video: Vid2Vid Cameo. Another technology is audio2face, which lets you create a video of a digital person talking, and the motion is created from the audio.” The team went further and had the final animated face being driven by just text. The text then was used to create a plausible version of his voice. Using the RAD-TTS NVIDIA model the demo showed the typed messages replacing the audio fed into Audio2Face.  In fact, the team even had the face singing!

“I think one of the goals was to combine all these different pieces to achieve something that no one could have imagined,” says Liu. “and so one end application is to make end-to-end digital character creation easier… but various parts have their own interesting applications in things like video conferencing.”

Real-time Live Chair Chris Evans (L). and (R) broadcasting live from our Silicon Valley headquarters, the NVIDIA Research team presented a demo of lifelike virtual characters for projects such as bandwidth-efficient video conferencing and storytelling.

The final face that was manipulated live was a neural render, it was not using the Omniverse renderer. While the demo was not a direct Omniverse demo the Real-Time Live event did overlap with R&D that is part of NVIDIA’s big push with Omniverse. Another aspect of the technology the team showed was using a version of StyleGan to allow the pipeline to do this to something other than a real photo. This end-to-end conversational digital human can be built from an image of a completely fabricated digital human or a stylized cartoon style rendering. This is important as the faces normally created from statistical training data are not temporally consistent, and would jitter if animated. This new approach does not have this problem as once the new face is inferred it is then treated as a still photograph and animated seamlessly.

Part of the success of the demo technically was that the process did not require specialist training data or ROMs to be filmed prior to the face being inferred for animation. Clearly, a StyleGan face is just one frame so there is no stock library footage or any moving footage to use as training data.

The reason that the demo works particularly well for video conferencing says Nagano is that the system does not just render out inferred frames and then transmit them as video. The system actually encodes a vastly smaller data stream and then infers the results further down the pipeline. This means the video quality is much higher than just streaming video. While normal video may appear blocky and stutter – the Real-Time Live demo looks remarkably normal.

The NVIDIA team of researchers combined four separate AI models into one impressive demo, showing state-of-the-art streaming digital avatar technology, to win 2021’s Real-Time Live.

Audience award:

Broadcasting live from the mocap stage was Animatrik Film Design, the winners of the audience vote for the LiViCi Music circus performance. They presented a real-time digital circus performance combining the complex, death-defying motion of live acrobats with real-time performance capture and rendering – all choreographed to a brilliant original poem. The presentation was by Athomas Goldberg, Creative Director of Shocap Entertainment Ltd, and the lifelike and believable animation design was by Samuel Tetreault Artistic Director of Les 7 Doigts / The 7 Fingers.

Congrats to all the teams and to Chris Evans for Chairing such a successful event.