Omniverse Marble Madness with Jarvis

NVIDIA CEO Jensen Huang announced during his GTC 2020 keynote a set of major new technological advances for the company, including three that are very relevant to the Media and Entertainment (M&E) space.

  1. NVIDIA raytracing benefiting from Deep Learning Super Sampling (DLSS) 2.0
  2. An expanded Omniverse
  3. Remarkable speech to character animation, as part of the Jarvis conversational AI framework.
CEO Jensen Huang’s ‘Kitchen Keynote’

Real-Time Ray Tracing

DLSS 2.0 is a new and improved deep learning neural network that boosts frame rates and resolutions while generating beautiful, crisp game images. It gives artists the performance headroom to maximize ray tracing settings and increase output resolutions.

In his talk, Huang highlighted a simulation and real-time GPU rendering demo called Marbles RTX. The demo is a playable game environment, displaying real-time physics with dynamic lighting and rich, physically-based materials. Huang also profiled the Omniverse platform which was used to make Marbles RTX.

During the virtual keynote, Huang showcased the demo piece created remotely by the NVIDIA creative team, to illustrate the power of their RTX ray tracing and the Omniverse Platform. Marbles RTX was created by a distributed team of artists and engineers using Omniverse. They assembled the VFX+ quality assets into a fully physically simulated game. The demo required no sacrifice in quality and fidelity that typically associates with “gamifying” art assets to run in real-time. Marbles RTX runs on a single Quadro RTX 8000, simulating complex physics in a real-time ray-traced interior set. The demo is not able to be downloaded “as it relies on the Omniverse server,” explained Richard Kerris, GM of NVIDIA’s M&E, to fxguide when we spoke to him separately.

The Marbles demo is impressive but it also has a host of CG easter eggs and humour

Creating visual effects, architectural visualizations, or manufacturing designs typically requires multiple people collaborating across teams, remote work locations, and various customer sites for reviews. 3D assets are developed using an assortment of software tools. Data transfers across applications have long been the challenge for millions of artists, designers, architects, engineers, and developers globally. Using Pixar’s Universal Scene Description (USD) and NVIDIA RTX technology, Omniverse offers a way for people to easily work with applications and collaborate simultaneously with colleagues and customers, wherever they may be. We highlighted such an example with a GauGAN demo a couple of weeks ago here at fxguide.

This technology is also very applicable to virtual production, especially with many companies currently in lockdown and working with distributed teams. Kerris, remarked that, “companies such as ILM have had five times the level of interest in virtual stages in the last 90 days.” NVIDIA provided the GPU cards that were used in ILM’s The Mandalorian virtual stage project. Omniverse has been in development for some time, but it is ideal for remote collaboration.

While the Marble demo ran on a single RTX 8000 card, NVIDIA’s Omniverse has been expanded to include a new type of rendering with Omniverse View. This module is accelerated by multiple NVIDIA RTX GPUs and built for extreme scalability on arrays of GPUs to provide high-quality, real-time output, even with huge 3D models. Omniverse View displays the 3D content aggregated from different applications inside Omniverse, or directly in the 3D application being used. It’s also designed to support commercial game engines and offline renderers.

ML Ray Tracing Scaling

DLSS 2.0 offers image quality comparable to native resolution while rendering only one quarter to about half the number of the pixels. It employs new temporal feedback techniques for sharper image details and improved stability from frame to frame. The original DLSS 1.0 required training the Machine Learning (ML) network for each new environment or game. DLSS 2.0 was trained using non-game-specific content, delivering a generalized network that works across different visual environments. It is not a general Upres tool for imagery. It uses a convolutional autoencoder, which takes the low-resolution current frame, and the high-resolution previous frame, to determine on a pixel-by-pixel basis how to generate a higher quality current frame. DLSS 2.0, therefore, has two primary inputs into the ML network:

  1. Low resolution, aliased images rendered by the render engine
  2. Low resolution, motion vectors from the same images — also generated by the render engine.

Whereas historically NVIDIA spoke of CPU and GPU, the focus of this year’s GTC presentation was cloud/server-based computing combined with DPU (Deep-Learning Processing Units). Previously the NVIDIA narrative was primarily Ray-Tracing with AI, – but those results were more a reflection of noise reduction algorithms. This year, Huang showed incredible up-resing, with ML providing dramatic results in inferring higher resolution ray tracing renders. The process couples ray tracing with ML to produce high quality renders above the native resolution being ray-traced. In fact, in the demo, the higher resolution 1920×1080 resolution AI rendered images, which were upconverted from 720P, seemed to be more detailed than the matching render that had been natively rendered at 1920 resolution.

Another key demo was showing the ray tracing of Minecraft. this had previously been released. In April, a beta version of Minecraft with RTX was released. Mojang Studios and NVIDIA made a Windows 10 edition of the game which offered top-to-bottom path-traced ray tracing.

Minecraft Ray-Traced



The second demo involving Omniverse was the Jarvis demo section of the keynote. Jarvis is the new NVIDIA conversational agent system.

Conversational AI is one of the most difficult inference problems, requiring a large amount of ML, with complex speech recognition to Natural Language Processing (NLP). The inference needs to be extremely fast, or the conversation lags and the effect or illusion of a conversation is broken. In addition to the impressive AI, the demo also showed the agent output being converted into plausible human speech. Two characters were shown being driven by the Jarvis pipeline, one of which was an interactive water drop character called Misty. But perhaps the most impressive and relevant M&E demo was the lipsync rap demo, which produced remarkable lip sync from an NVIDIA employee who provided only the audio of a rap, which was then interpreted into extremely robust lipsync on a base model of a real face.

Conversational AI and Lipsync

“The rap demo (featuring our employee, John Della Bona, ‘JohnnyD’) demonstrates the speed and accuracy of Jarvis to power conversational AI and real-time facial animation as well as the real-time animation and graphics capabilities of Omniverse Kit, a true breakthrough end-to-end solution,” Kerris told fxguide.

John Della Bona voice was used to drive a 3D model (from audio alone)

Jarvis is a fully accelerated application framework for building multimodal conversational AI services that use an end-to-end deep learning pipeline. Developers can easily fine-tune state-of-the-art models on their data to achieve a deeper understanding of their specific context and to optimize for inference to offer real-time services that run in 150 milliseconds (ms), versus the 25 seconds required on CPU-only platforms. The Jarvis framework includes pre-trained conversational AI models, tools in the NVIDIA AI Toolkit, and optimized end-to-end services for speech, vision, and natural language understanding (NLU) tasks. Jarvis comes with a set of pre-training which represents 100,000 + training hours, but this is then added to by a developer with the NeMo module that blends in domain-specific terms and additional localised training. This addition of special terms and training data, combined with the base Jarvis training, produces incredible results.

Fusing vision, audio, and other inputs simultaneously provides capabilities such as multi-user, multi-context conversations in applications such as virtual assistants, multi-user diarization, and call center assistants.

Jarvis-based applications have been optimized to maximize performance on the NVIDIA EGX platform in the cloud, in the data center, and at the edge. A major theme of CEO Jensen Huang’s kitchen keynote was NVIDIA’s advances in Data Centre Scale. The scope and cost savings for NVIDIA’s AI server farms is incredibly impressive.