A video has just been created by Paul Lacroix and Hiroto Kuwahara using ground-breaking facial mapping technology, which has never been used before in a music video. The project is a collaboration between Music with Empara Mi, Face Mapping Technology and Paul Lacroix and digital make up artist Hiroto Kuwahara.
Facial mapping, or facial projection mapping, is a technique of using sensors to detect and track the position and orientation of a human face and then project a real time animation onto it. When the face is moving, the projection is adjusted accordingly in real-time.
Everything you see in the video below has been filmed in real time, with no post effects added to the visuals.
Paul Lacroix and Hiroto Kuwahara had previously worked together on Project Omote / Face Tracking & Projection Mapping (which got more than 7 million views on Vimeo) and Face Hacking (on Fuji TV – Japan), They were respectively in charge of technical direction and the artistic direction / make up design for this new project.
On the previous project, Lacroix designed a facial mapping system based on the motion capture system OptiTrack, which he had experience with. In the new video of The Come Down, the detection of the small markers upon the face is performed by using a new and completely original system created and customized for facial mapping. Additionally the software for the face tracking and of rendering has also been upgraded.
Original motion capture system
After working on Project Omote and Face Hacking, there was the feeling that in order to go further in the facial mapping experience, the software and the hardware would need to be optimized. Lacroix has been working for the last two years in his own time, on creating an original motion capture system customized for face projection mapping (hardware and software). It uses industrial grade computer vision components that provide a low latency while maintaining high accuracy. As it is homemade, it is not a black box and therefore offered the team much more flexibility in this specific project.
“The system configuration for The Come Down was composed of 4 ‘sensors’ placed 1.5 meters away front of Empara’s face. The number of markers (ten) as well as their size has been reduced compared to previous projects. As a result, they are much less visible in the video,” commented Paul Lacroix.
The process of facial mapping is composed of these steps:
- Detection of small markers placed on the face
- Estimation of the 3D position of these markers
- Estimation of the face position and orientation (from the marker positions)
- Rendering of the 3D model of the face on which an animated texture or real-time effect is applied
- Projection onto the face
In order to maintain a natural aspect when the face is moving, all these steps must be performed as fast as possible.
The facial mapping process relies on two pieces of software running simultaneously on a single computer:
- The sensor manager software in charge of controlling the sensors and extracting 3D information from the sensor data
- The face tracking and projection mapping software (called Live Mapper) that estimates the face position and orientation, then render (i.e. draw) the 3D model on which is mapped a texture animation video
For the new project these are written in C++ and use OpenGL for the rendering. Lacroix commented that a future improvement may consist in merging these two pieces of software into a single piece of code to gain an even tighter loop and thus more speed efficiency.
“I avoided using toolkits like vvvv or Openframeworks to keep control on the program as much as possible (to avoid unnecessary operations and keeping a low latency). I just used a little bit OpenCV for the initialization part of the camera calibration,” explained Lacroix. The renderer is directly using the graphic library OpenGL and using GLSL shaders for the reflective effect (mirror) and the water effect. “The reason that I did not use a game engine like UR4 or Unity is to avoid unnecessary operations and to be able to make optimization for the latency.”
Beside the latency of the cameras, the processing of the marker data and the rendering also takes time “but the biggest latency is coming from the projector itself. I did not measure the latency yet but it should be be less than 100ms,” he explains.
The cameras were not RGBD but monochrome infrared cameras (computer vision cameras). Another important point the team came across was the native framerates. In Japan, the TV frame rate is 30 FPS, so the computer rendering is set at 60FPS, but as the TV frame rate in the UK is of course 25FPS, “so I had to slow down a little bit the rendering to 50FPS,” Lacroix added.
3D model and Texture
In order to make the projection to fit to the artist’s real face, Empara came to Tokyo and got her face scanned with a 3D scanner.
“For testing purpose, we ordered a 1:1 scale printing of the head on which we could make face mapping tests before the real video shooting,” commented Lacroix.
The base 3D model does not contain any colorimetric information, so next step was for a professional photographer to take pictures of Empara’s face. Taken from several angles, these pictures were textured on the 3D model and merged into one single reference texture image (containing all the details of the face).
When creating the tracked effects and animation, this reference texture was used as a base in order to keep a proper correspondence with the real geometry of the face. “Though our process for generating reference texture from real pictures takes time, it is an important step from our point of view. It increases the realism of the projection compared to what we would get if working only with a computer generated design,” he adds.
Passive optical motion capture systems often use sphere or half sphere markers. But as the team were putting the markers on a face, they wanted to keep them as small and invisible as possible. The end solution was just 1.5mm diameter and flat. “That makes the detection of the marker position sensitive to noise. On the other side, capture condition are quite stable (distance to sensing cameras, size of the markers, lighting of the room…). The marker detection algorithm was optimized, so far as it was designed to provide the best quality in these particular (limited) conditions,” he explained.
Some of the content such as fire or the dropping gold are texture animations that have been previously generated and just remapped live. But some are fully realtime rendered content such reflective effect (mirror) and deep water effect which are fully real-time simulations: “the projected content is changing depending on the position and the orientation of the head. These effects are called “Shader contents,” he explained.