A team comprising of USC ICT and Pinscreen researchers recently published a paper which has been getting a lot of attention. Entitled “Photorealistic Facial Texture Inference Using Deep Neural Networks” it’s accompanied video shows a single image being used to construct a detailed textured face.
The approach uses deep neural networks to synthesizing photorealistic facial texture and matching 3D geometry. To demonstrate the process, the team digitized a set of historic figures from simple still images, produced high-fidelity facial texture maps with mesoscopic skin detail.
How does this approach work?
The full explaination is contained in the official technical paper, but here is the (considerably) less technical version. If you are well versed in the literature and technical in nature, please just read their great paper, but if you’d like the ‘VFX artist version’ suitable for a solid background understanding, then please read on.
The whole process is data driven, that is to say, that it relies on having a bunch of data already and it access this to both estimate, guess and recreate detail that is not in the original image. This is a same basic approach as much of the current AI that you have seen appearing with such interesting results in recent times. The AI used here is not so much super clever direct algorithmic programming, although it is clever, it solves the problem by learning from examples with attached meta data. If one understands enough examples of something then you can infer a lot when you see a new example. Imagine, you have tasted many foods and you know you don’t like curry. It is therefore not magic to know you wont like a menu item labelled Hot Curry, even if I have never seen it yet alone smelt or tasted it.
After an initial estimation of the shape, the computer turns its attention to the shaded texture of the skin. If the face was lit with no directional light, just ambient illumination it would be easy for the computer to get a natural neutral texture. But no normal photo is completely flatly lit, so the computer needs to estimate the texture map without shading. This is the low-frequency albedo. The problem is made worse as a single image will only provide a partial view of the subject’s face, so not only are there shadows but obscured sections of the face texture map.
SIDE BAR: Background on Computer Vision
Before going on, it is important to understand how computer vision has been transformed in recent years. This is also very helpful for understanding a wider class of AI approaches. You may be aware that computers have become much better at ‘seeing’ images in recent years. A computer can now be be given a couple of images such as these below and tell you that one is a Manta Ray and the other contains Stingrays.
Between 2005 and 2012 there was an open competition in the computer vision community called the PASCAL VOC Challenge. This was a challenge to see how many correct images a computer could read or identify. To run the competition, there is a publicly accessible set of 20,000 ‘correctly’ identified images. For example, there might be an image of a Manta Ray and some asscoiated meta data stating that it is a Manta Ray. In the annual competition, computer programs would be given these images randomly and rated on how many it could correctly identify. The images were in 20 groups of ‘object classes’. Later in 2010, this competition was expanded to an even larger set called the Imagenet Large Scale Visual Recognition Challenge (ILSVCR) – but the challenge was identical.
The reason the set of images needed to be expanded is because of just how clever the computers suddenly got. Up until approximately 2011 the best a computer could do was identify around 75% of the images correctly. Using what we might think of as the old normal programming approaches, the machines would be wrong about 25% of the time. Then, in 2012, a deep convolutional neural net (CNN) achieved 16% ! Within two years virtually every team from each major research institution had switched to using a CNN approach. In the following two years the error rates fell to a few just a percentage points. This means that by last year (2015), computer vision software identification exceeds a human’s ability to name things in theses vast image sets. People can still recognize a larger number of different categories and we understand more about what we are seeing, but computers can ‘see’ and name objects now better than we can, – in their training spaces.
The last four words in the previous paragraph are key: in their training spaces. The way this form of AI works is both insanely clever and very simple. The computer is given a framework to build it’s own solution. It is not programmed to ‘think’ (sorry it is not HAL). This framework is best thought of as a series of tiny tests and each one is super simple. The computer is then given labelled training data. It effectively runs the data through the series of tests, which are in layers and it sees how the tests in that order, in that series of layers etc did in correctly matching the right answer. If we simplify this into an English language example it is much like the children’s game 20 questions… (but this is never actually how it is done and the ‘test’ are simpler than in this example):
Look at the image,
- Is it lighter at the top? YES.
- Has it got gradations (shading) in the image? YES.
- Is it REDish? YES.
- Is there green at the bottom of frame? YES.
- Is there blue at the top of frame? YES.
- Is the gradated thing also a round thing? YES
- Is there a sharp transition from the green stuff to the blue stuff? YES
I think it is a cricket ball on grass. Look up the meta data and see if it is a cricket ball? Now vary the questions and try again.
This idea of 20 Questions is just that, our idea of how to conceptualise the process. The computer could also be thought of here as a black box, since we never see or know what series of tests and permuations it does. We just give it the tools – and say “work out for yourself”, there is never a human readable or perhaps even, understandable version of what the computer does inside the black box. The main advantage of a learning approach is that the specific test details need not to be specified.
The computer does this process many times, a huge number of times actually. Each time it tries to see if a change does anything differently , – does it score a lower error rate? If it does, then it switches to using that new model, if it does not then it stays with the best set of question it has so far tested.
This process does what computers do well – a simple task, done a billion times over. It is looking for any slightly better combination or approach that will yield more accurate result. The second thing this process does is build a CNN or Decision tree that once worked out, runs really quickly. The training iteration takes forever but once you have a ‘build’ it is very fast. Once training is over, the whole system runs by asking it’s computer version of 20 questions on any new image it gets, and at each point the computer branches left or right, Yes or no. This is something computer do really quickly.
The actual AI is much more complex (and it is often not just yes no) but the important thing to note is that it learns from a training data set. If you give it a bunch of things it has never seen, then it will not work as well. Humans can guess, computers can’t and they rely on good training data sets. This data set defines the problem space the computer is good at solving.
BACK to the Paper
For the Photorealistic Face Textures paper, the team used an AI approach from a group who published Very Deep Convolutional Networks for Large-Scale Image Recognition. This approach and the data training set won that team first and the second places in the localisation and classification tracks of the 2014 ImageNet Challenge. One of the great joys of academic publishing is that people publish their work and the next team does not need to reinvent the wheel. (This is why the Internet was so happy when Apple recently decided to start publishing their team’s AI research).
To extract the fine appearance details from this incomplete image of Trump’s face below, the team introduced a multi-scale detail analysis technique based on mid layer features found by their deep convolutional neural network (the CNN above).
This accessing of a high-resolution face database can yield a plausible facial texture of the entire face. This means a complete and photorealistic texture map can be made.
Next requires making geometry for the face. Here the team reference back to a classic 1999 SIGGRAPH paper: A morphable model for the synthesis of 3D faces by Volker Blanz and Thomas Vetter. When this orginal paper was published it stated quite clearly that to match points on a face to align to different faces, required “human knowledge and experience” to compensate for the all the variations between individual faces. Today, AI can do that task for the user.
The 1999 paper offered “the morphable face model” as a multidimensional 3D morphing function “that is based on the linear combination of a large number of 3D face scans.” In 1999 a database of “Laser scans (CyberwareTM) of 200 heads of young adults (100 male and 100 female)” was used.
Their seminal work on morphable face models provided a synthesis framework for textured 3D face modeling. The 1999 work used a Principal Component Analysis (PCA)-based face approach, built from their database of 3D face scans. What is PCA? It is a maths term that can be thought of as transforming a component of the face, ideally the one with the largest variance and then moving to the next aspect and adjusting it, but in such a way as to not undo the work you just did. This iterative method makes smaller and smaller tweaks until the model lines up the best it possibly can.
Many state of the art face modeling algorithms from monocular or single stills are still based on the Blanz and Vetter’s original morphable face model paper.
Today, the team at USC and Pinscreen are expanding on the Chicago Face Database. This new 2015 database of faces provides 158 high-resolution, standardized photographs of Black and White males and females between the ages of 18 and 40 years. The images are all 2444 pixels (wide) ×1718 pixels (high) with balanced color temperature on a plain white background. Most significantly there is extensive data about all these target people, from the distance between their eyes to measurements of cheekbone prominence. For each face there is a large metadata set of face measurements.
To this already impressive data set, the USC / Pinscreen team are making available an additional new set of 3D face models with high-fidelity texture maps based on the Chicago Face Database high-resolution photographs (this will be freely available to the research community).
If the PCA approach is so effective, why not do the texture that way as well? The answer is that previous methods did exactly that, but one does not get high frequency results for textures if you use the PCA approach for the texture component. It works for building the face model but the new approach produces a much better texture as you can see in the image below:
How good is the results? The team compared them to a range of options including their own data from their own LightStage. The team got a set of “turkers” from Amazon’s Mechanical Turk (MTurk) crowdsourcing Internet marketplace, to rate a set of faces which let the team compare how real people viewed the Lightstage faces – compared to the new method faces. The results are published in their paper, but it stated that “the results are visually comparable to those from the Light Stage and that the level of photorealism is challenging to judge by a non-technical audience”
The actual texture is doing a lot of the heavy lifting in the final images seen in the video. There is also no eyes, nor any skin micro hair, eye brows or facial hair. As the paper works from a still image, the output face is not a rigged animation face. There would be nothing stopping an animator taking this face and rigging it, but clearly there would be no FACS or range or facial motion provided by the single image. This is not to take anything away from the results. The main contributions of this paper are
- The team introduce an new way to generate high-resolution albedo texture maps with plausible fine details from a single unconstrained image.
- They show that plausible fine details can be made by blending high-resolution textures using a combination of feature matching (obtained from a deep neural net).
- They also showed that their approach is not bad compared to say USC ICT’s own Light Stage capture.
- They are offering back to the community a new dataset of 3D face models with high quality texture maps based on the Chicago Face Database.
Using the high-resolution texture created along with an adapted face geometry, the face can be renderer using a program such as Solid Angle’s Arnold. The final highly detailed 3D outputs are visually believable to a comparitive level as a face made using scanning or photography with multiple cameras.
This system will not out perform scanning or photography in situtations where it is possible to get access to an actor or subject. This approach gets only one expression with no FACS data, and while the texture are believeable, they may not be accurate. However, for any job that requires producing a face of someone who is no longer available or from a time when they were much younger, – this technology is remarkable and would be invaluable as part of a face pipeline. It is also very likely that this technology will dovetail into wider more democratised avatar applications for mobile phones and other similar non-professional uses. There is little doubt, reading between the lines of this great paper that this technology will appear in Pinscreen’s as yet unannouced integrated avatar application. The desire to produce high quality tools and results for people without specialist rigs or experience has been a trademark of this team’s research.
Check out Part 2 tomorrow: Focusing on the work being done in this area in Northern California.