Last Saturday night, at the Beverly Wilshire in Los Angeles, Luca Fascione, J.P. Lewis and Iain Matthews were honored for the design, engineering, and development of the FACETS facial performance capture and solving system at Weta Digital.
FACETS was one of the first reliable systems to demonstrate accurate facial tracking from an actor-mounted camera, combined with rig-based solving, in large-scale productions. This system enables animators to bring the nuance of the original live performances to a new level of fidelity for animated characters.
FACETS was developed primarily to serve the needs of Avatar at Weta Digital, and it represented a major rewrite to the approach used for King Kong. The FACETS system is very much aimed at facial performance not generalised motion capture. Avatar was the first time Weta used a head mounted camera rig for facial motion capture. Previously for say Weta’s facial work on King Kong, actor Andy Serkis had been filmed with a camera not attached to his body. Furthermore Serkis was not acting in the scene (in part due to the scale issues), “so his performance was captured more on par with an ADR process”, commented Fascione.
The three recipients of the Scientific and Engineering Award (Academy Plaque) focused on different parts of the system, while all working together.
The FACETS system captured actors on set with mono head mounted cameras, interacting with fellow actors and then as a post-process produced extremely high quality facial animation for virtual characters. It was developed for the film Avatar at Weta Digital in New Zealand.
Iain Matthews was building on his academic research into Active Appearance Models (AAM). This work focused on the high quality micro detection and face tracking work that was done offline. While some realtime tracking was done on set, there was also slightly slower, non-real time, but high quality facial processing so that the next stage of the process had the most accurate data possible.
Luca Fascione worked on the solver that takes the facial tracked data and ‘solves’ it into a set of blendshapes which are then presented to the Weta artists as a set of interpolating and combined facial expressions. “I was doing the solver part – the part that goes from the micro positions on the face to driving blendshape channels on the virtual characters” he explained.
J.P. Lewis built on Matthew’s work with complex filtering that could remove noise without removing valuable high frequency information, and most specifically the development of the original tracking approach the team used.
Key to the success of FACETS was the team’s work on editability and ‘salient point detection’.
On set face rigs and high frequency filtering.
As noted this was the first Weta Digital project with a head mounted camera. While today facial capture teams often use stereo or quad cameras, on Avatar the team was limited to one camera and at a much lower framerate and resolution than today’s high quality computer vision cameras.
What is key to any system like this, is the ability to capture the micro expressions on the actor’s face while avoiding confusing this with noise and error in the pipeline. “What one used to do to address this was have far more markers or dimensionality than the problem space would require”, explains Fascione. On the film King Kong, Andy Serkis had some 119 markers on his face. “This means several hundred dimensions, but you are mapping this onto some 50 channels – so you have a 6:1 redundancy space – which was great. All the wiggles or noise that isn’t really there is removed by the solver”. Lewis recalls that “there were just so many problems to be solved, and while it did not seem impossible it was definitely challenging with so many shots to be solved”. For Lewis the capture approach was to not do any smoothing to the footage, and avoid any possible loss of important high frequency subtle movement. He remembers that by 2008 there was no smoothing done at all in Weta’s capture face pipeline, but prior to that the team worked hard to avoid any data curve fitting or smoothing being needed.
For Avatar the team moved to painted markers on the face with a head rig (ideally in exactly the same place each day). Even if the team could have accurately painted a vast set of dots, the video quality on the head rig was just not good enough to deal with such fine detail. “So we went down to a rig of 50 to 60 markers, which is a lot closer to a 2:1 redundancy, … and it worked fairly well”. Having good filtering and processing was therefore particularly important on Avatar. Lewis recalls the cameras used on Avatar’s head rigs were low resolution (720 at best) and extremely low quality by current standards (60i, 30hz). These early head rigs also did not have an attached light so some of the trackers in the footage would often be in shadow.
Salient point detection.
One of the secrets to the FACETS success is something called “salient points”, which is a system for simplifying the result of the facial mocap solve into a set of keyframes similar to those that might have been produced by keyframing the motion from scratch, with residual detail curves to capture the difference between the keyframed and the original.
“This algorithm works far better than the curve simplify tool in Maya” says Lewis, “and in fact is provably optimal, given a definition of the error between a curve and its’ approximation”.
The tool allows animators to do further editing just as if they are working on keyframed animation. It is reasonable to say that the real significance and widespread adoption of ‘Salient points’ came into wide use after Avatar at Weta, as the team came to understand the significance of this approach.
An interesting theme in the development of the FACETS system was speed vs.quality and the artist’s role in the process. For example, in the tracking component, Weta’s Shane Cooper produced the first face tracker which was importantly real time but not of sufficient quality for final use. “The next version was toward the opposite extreme, high quality and automatic (or nearly so), but slow. That in turn was followed with a 3rd version that was nearly as high quality as the second version, not real time but fast, but required some per-shot artist input” explained Lewis. This 3rd version was preferred by most of the artists, around the time of Avatar. The preferred method requires some extra work on their part, coupled with fast feedback, rather than a more automatic launch-and-check-later approach.
The whole system emphasized artist-editability in the pipeline. Fascione developed the solver and unlike others at the time, it had the ability for the artist to insert new basis shapes into solver. When an actor does their Facial Action Coding System (FACS) poses away from the set, they may not provide enough poses to cover the full range of expressions they emote on set. Fascione’s innovation was to allow additional poses to be added to the mix, which fixes not just that one shot, but it adds to the core solver’s solution space. From then on, the system will include this new expression in future solves. “The system is so fast you can add on the fly, you can just say I want to use this one frame as one more basis element for the solving space,.. and you can decide how that maps into your virtual character, and this is all live”, explains Fascione. Lewis thinks that this approach turned out to be critical to not only the system working – but artist acceptance and adoption of the system. “I think Luca’s solver was distinguished from previous academic work (e.g. Pighin from around 1997) in terms of it’s ‘artist-directability’. I think artists accepted the system because of this” he comments.
Expression combination problems
One aspect of a FACS system that is complex but limiting is the way expressions are combined. Each part of an expression is called an action unit or AU. In simple terms, if we call an AU eyebrow raised ‘A’, and an AU smirk with the mouth ‘B’, then any face pipeline system around the world will allow A+B = A and B happening at once. The problem is that this assumes what is known as ‘linear combinatorial expressions’. It assumes that the way an actor raises an eyebrow (AU: A) when not smirking is the same as how they would raise it if they were smirking. This is at the heart of the assumptions of a FACS approach with blendshapes: namely that you can combine two AUs to get a more complex expression, ie. you can build up expressions by adding AUs together. In reality, these two versions of the ‘eyebrow raise’ we labeled as ‘AU:A’ – may be slightly different. Since one cannot capture all the combinatorial variations of every AU with every other AU permutation, the problem is fundamental to face capture. FACETS additional system of adding new basis shapes allows artist to address this by adding intermediate poses that compensate. And, as before, once this has been done the system will thereafter correctly handle any eyebrow raises with or without the AU of a mouth smirk.
Key to Weta’s solution is that the artist can edit animation curves and performances. Editing here should not be seen as an artist fixing up a ‘mistake’ in the process. If a human face such as Andy Serkis is being re-targeted to an Ape such as Caesar, then the film’s final performance only works because Andy’s emotional performance is interpreted to the different skull and muscle set of Caesar. The aim is not to have Caesar look like Andy Serkis. It is to have the audience understand the motivation and subtext of the character of Caesar. The editing rather than being a fix, is the very essence of what allowed such relevant performances on screen from Neytiri, Jake Sully, Eytukan, Caesar, or Koba. This is a point the popular press sometimes fails to grasp, a perfect completely automated 1 to 1 motion capture to retargeted character would look wrong and creepy. The audience has to believe the inhabitants of Pandora are their species, or that Maurice is an altered orangutan. In Maurice’s case not just having actress Karin Konoval’s face translated onto an orangutan’s body.
Weta Digital has an interesting step in their face pipeline. In between solving for the actor and retargeting they use a generic face as a quality check. In other words, they solve from actor to their own double, but they also have a generic human double which they solve on to, as validation that the solving is working as expected. “That is to say we solve to a generic face, between solving for the actor and going from the actor to the virtual character”, explains Fascione. This allows the animators to really understand how the solver is working for this particular actor, both in terms of matching a performance and how that would map to a standard generic face, before they see how it maps to a retargeted character which might have a very different facial design.
The solver is trying to estimate or ‘fire’ the particular digital muscles on the Weta rig. But what can defeat some solvers is a problem where the actor is not exactly hitting any actual simple combination of AUs. Imagine an actor giving a performance where their expression could be a smile or it could be something else. Picking say a partial smile activation solution is fine for any single frame, and the result may be great, but one does not want the solver to pick the other way or muscle solution on the next frame and effectively ‘flicker’ between solutions. FACETS is very good at addressing this problem. It may seem like a small improvement, but Weta Digital operates at the very cutting edge of facial animation. Weta’s Oscar winning VFX Supervisor Joe Letteri is known for wanting to get Weta’s work as correct as possible and with as few behind the scenes hacks or manual error corrections. FACETS manages to decide strongly in favour of its’ choices and thus it avoids this solver indecision. “This solver is particularly good at not freaking out, and when you have two faces that are very similar to each other it doesn’t get confused”, adds Fascione. This provides a very consistent temporal solution.
What has made Weta Digital famous in this area is their flesh simulation work. Flesh simulation is incredibly computationally expensive, but on characters such as Smaug, Weta has been known to simulate the airflow while the Dragon is talking thus making tiny check movements occur which dramatically enhance realism. In the case of Avatar the character of Jake Sully rides a Ikran (Humans call them mountain banshees) at huge speed down from the floating Hallelujah Mountains. Actor Sam Worthington’s FACS or facial ROM session would have been done with him seated quietly or on set he was recorded on a moving gimbal, but neither would have the G-force or wind effects on the face that Worthington’s Avatar Jake Sully would have experienced.
Weta allows for animated physics to change the ballistics which will allow activation of the muscle system and flesh system to augment the basic solve. This sits on top and runs additional variables such as mass (gravity effects), wind, acceleration etc, for localised activation, “so you get all the drivers on the soft tissue of the face to make say the cheeks look realistic” explains Fascione.
Not everyone who contributes to the development of a pipeline solution is able to be recognised by the Academy formally, but the team all stressed how critical a component the work of Shane Cooper was in his development of the real-time/lower quality tracker.
The team also worked closely with Dejan Momcilovic, Head of Mocap at Weta Digital. Momcilovic had some suggestions on Luca’s solver, specifically related to adding new basis shapes to guide the solve. “He was also on the Avatar set and responsible for obtaining all the facecam data that we used” explained Lewis. Fascione also points out that on King Kong they only had one actor representing one virtual character, on Avatar “we would have to set up 6 or 8 facial actors each morning, and all the actors needed to come out on set at the same time, plus when you come off set you have ten times as much data for the motion editors. Rethinking the Kong pipeline for Avatar was a major contribution”. To add to the complexity of running the pipeline on set, King Kong had a much smaller facial shot count, “especially when one allows for the fact that in any effects shot in Avatar, there could be many digital faces needing to be managed at once.. the size of Avatar’s face pipeline was so huge in comparison – we had to completely rethink how we handled everything, and Dejan was a huge part of that”.
The earlier key work in the faces pipeline at Weta Digital was already recognised by the Academy with a previous Sci-Tech award. Mark Sagar, when he was at Weta, had pioneered the original use of the FACS system for virtual characters. His work was based on the non-CGI research devised by Psychologist Paul Ekman originally to identify and classify human facial expressions. The Scientific and Engineering Award (Academy Plaque) went to Mark Sagar “for his early and continuing development of influential facial motion retargeting solutions.”
Similarly, Simon Baker played a key role in the development of FACETS. As researchers at Carnegie Mellon University (CMU) Simon Baker, now at Nvidia and Iain Matthews, (who is now at Oculus Research), developed a real-time implementation of Active Appearance Models (AAMs) and licensed their work to Weta Digital, where it formed the basis of the facial motion capture for FACETS.
The novelty in the CMU approach was the “inverse compositional” framework which according to Matthews, “is a general algorithm for efficient gradient descent based image alignment that we applied to Active Appearance Models”, referring the work of Gareth Edwards who was the original inventor of AAMs.
“The tracking approach is similar to the Manchester folks and their AAM work, but based on the inverse compositional optimization framework for solving it. That means you can just as accurately consider it a shape-constrained image alignment, as it is closely related to the Lucas-Kanade tracking approach” explains Matthews.
The mouth and eye tracking components were inverse-compositional AAMs (i.e. deforming mesh and texture). “The face tracking component used a similar approach, but was specialized to tracking facial markers and is closer to a simultaneous point tracker centered on each of the face markers” he adds.
Nebojsa Dragosavac wrote some of the important infrastructure components of the system including camera calibration, projection, dewarping, and MotionBuilder plugins. Goran Milic, Teresa Barsali were two key artists who pioneered the use of the system. Since the days of Avatar, Milic has progressed and today is Weta’s Head of facial motion editing.
Today the research continues
Yeongho Seol and Alex Ma have since gone on to author a second generation solver and fxguide has previously discussed Ma and Yeongho’s Digipro 16 paper: Creating an Actor-specific Facial Rig from Performance Capture, co-authored by J.P. Lewis.
J.P. Lewis himself has just recently gone on to work with Artificial Intelligence and Deep Learning in California, after many years at Weta Digital in New Zealand. “John Lewis has an enormous amount of knowledge in this area” says Fascione talking about his former workmate.
This 2nd generation solver is still a constrained FACS solver overall, but adds nonlinear remapping, the ability to solve a 3D rig from the single camera view, and other improvements
“Today our system is more accurate, it has more constraints, it has a better provision for everyday practicalities. For example one of the great improvements in the new solver (today) is that it can cope really well with the camera not being exactly in the same position as when the FACS or Facial ROM was captured” explains Fascione. “Where as my system was not as robust, for these issues” he explains. Lewis adds to this point, “this is a big problem, the head camera is not meant to move around but it just does, one of the big advances they did was a hybrid 2D / 3D solution”. The current face team at Weta invented an approach where they solve for where the camera actually is in relation to the actor’s head by “finding a camera position that gives the best match between tracked points and the projections of corresponding points from the 3D rig.” Using this the team was able to do a much better job of removing the camera shake and this gives a better solve”
Luca Fascione, at Weta Digital, has since made huge contributions to the craft in a variety of areas most recently in the development of the advanced Manuka Renderer that Weta now uses as standard on all its productions.