The COVID-19 pandemic has pushed companies to accelerate advances in video conferencing and the use of AI and machine learning. Two companies at the forefront of these advances are Pinscreen and NVIDIA. Both are using advanced Machine Learning and Generative Adversarial Networks (GANs) to advance what is possible while approaching the problem from very different points of view.
At SIGGRAPH Pinscreen presented at Real-Time Live (RTL) two advances in real-time graphics. The work in virtual body scanning from a single camera or Monoport jointly won Best of Show, (see our story here). The second demo was of their fully digital agent, Frances, who was interviewed live during the event.
At fxguide, we were keen to test drive a video chat with Digital Frances ourselves, but we figured it would be more fun to have Digital Michael interview Digital Frances. In this exclusive video below, we speak to Founder Hao Li, and a neural rendered Digital Michael interviews Digital Frances.
The Digital Michael is a UE4 real-time rendered digital avatar, which then has an additional GAN generated face incorporated, driven by the UE4 face. Digital Frances is a fully virtual assistant that is answering questions and being rendered in real-time also in UE4. This includes her simulated digital hair and spontaneous unscripted responses.
Pinscreen uses their own in-house technology built around their paGAN software to power their UE4 characters and avatars. Their AI technology is independent of NVIDIA’s new Maxine software below, but it does use NVIDIA’s GPU card technology.
New AI breakthroughs have been shown as part of NVIDIA Maxine. This is a suite of new software developed to enhance video conferencing.
There is a set of new AI SDKs and innovations around video conferencing. For example, a cloud-native video streaming AI SDK, which dramatically reduces bandwidth use while making it possible to re-animate faces, correct gaze, and animate characters for immersive and engaging meetings.
Face Reanimation: Gaze correction
One of the biggest problems with video conferencing is that one looks at the screen, not the camera, so no one is making eye contact. In an attempt to reduce bandwidth by only streaming Neural Network and audio from the speaker, NVIDIA can simulate much higher quality video conferences and also solve the issue of eyeline. It does this by inferring where you should be looking and using neural rendering to rebuild your face, rather than just stream video. This reduces the bandwidth by as much as 90% to just one-tenth of H.264 using AI video compression.
With new AI research, NVIDIA can identify key facial points of each person on a video call and then use these points with a still image to infer a person’s face on the other side of the call using GANs.These key points can be used for face alignment, where faces are rotated so that people appear to be facing each other during a call, as well as gaze correction to help simulate eye contact, even if a person’s camera isn’t aligned with their screen. NIVIDA hopes that developers will also add features that allow call participants to choose their own avatars that are realistically animated in real-time by their voice and emotional tone.
This work by NVIDIA is part of the bigger Maxine program.
NVIDIA Maxine is a fully accelerated platform for developers to build and deploy AI-powered features in video conferencing services using state-of-the-art models that run in the cloud. Maxine includes the latest innovations from NVIDIA research such as automatic real-time translation, face alignment, gaze correction, and face re-lighting, in addition to capabilities such as super-resolution, noise removal, closed captioning, and virtual assistants. These capabilities are fully accelerated on NVIDIA GPUs to run in real-time video streaming applications in the cloud. As Maxine-based applications run in the cloud, the same features can be offered to every user on any device, including computers, tablets, and phones. And because NVIDIA Maxine is cloud-native, applications can easily be deployed as microservices that scale to hundreds of thousands of streams in a Kubernetes environment.
Maxine-based applications can use NVIDIA Jarvis, a fully accelerated conversational AI framework with state-of-the-art models optimized for real-time performance. Using Jarvis, developers can integrate virtual assistants to take notes, set action items, and answer questions in human-like voices. Additional conversational AI services such as translations, closed captioning and transcriptions are all possible.