Real time face landmarking using decision trees and NN autoencoders.
I’ve started working on this project while still at the university and I didn’t touch it since. So now, after two years, I’ve decided to revisit it and improve parts that didn’t like. I wasn’t sure if those changes would improve the overall quality of the proposed solution, as it’s never guaranteed in ML projects, but they did and I’m finally happy with how everything works. So here comes a short overview of the techniques I’ve used while designing and implementing this piece of software.
As usually, the project is open source so every bit of code (with an instruction on how to run it) can be found on my github.
About the problem
Face landmarking is a really interesting problem from a computer vision domain. The goal here is to find key points on an image of a face that could help us identify where specific face parts – like eyes, lips and a nose – are located. Furthermore, for more advanced applications, we might also like to detect the exact outline of those parts, in particular lips. And the final challenge is to perform all of that not on an image, but on a video input in real time.
And what’s a potential use case for such technology? Well, there are multiple. For example: speech form lips movement recognition, emotion detection, people identification or preprocessing of a video for further modifications (like with deep fakes). And of course much more.
What about existing solutions? There are planty. Probably as many as developers that ever needed them. But if you are looking for a nice, mature projects I would recommend you to take a look at this blog post by Satya Mallick.
So why have I decided to create my own software? As I’ve mentioned at the beginning, it started simply as a project at the university. Of course I could have picked a simpler topic, but well, it was just fun to try to overcome some of the challenges. Challenges like this are always fun.
Data
It’s hard to talk about algorithms before taking a look at the data. Especially in cases where we are not the ones creating a dataset, but only using an existing one. When doing so, we usually have to adjust our approach according to what is provided. But don’t understand me wrong, it’s not always a bad thing. It almost always speeds up the process drastically, not only because we don’t have to collect data ourselves but also because we can assume that the previous developer has spent some time on thinking what is the best representation for given data.
For this project I’ve picked the Helen dataset. It consists of 2330 images of human faces expressing different emotions, taken in slightly different angles and in different lighting conditions. Each image is annotated with 194 points placed on chin, lips, nose, eyes and eyebrows. Thankfully the order of those points is always the same. For example 114th point is always the left corner of the left eye. This will become very helpful later.
Overall quality of this set is very good considering that all of those images have been annotated manually. Unfortunately it’s not perfect. It’s ambiguous where the jaw starts, not mentioning eyebrows which are usually outlined way off their exact position (same for nose). Also people on those photos (what of course makes sense :D) have almost always their eyes open. Because of this the created model is not able to properly mimic blinking while analyzing a video input (what can be seen on the example above). Finally, the mouth shapes to be seen on those images are rather limited with most of them being closed or just slightly opened – what also imposes some limitations on the capabilities of the final solution.
But even so, this set is more then enough to create a basic face landmarker. It’s worth noting that it doesn’t matter that those examples are just images and not video files. We will take advantage of having sequential data on an input in a different way. We don’t need such learning data. Of course, if we did, we might think of using a different learning process, but let’s stick to what we’ve got.
The final question is if 2330 examples are enough – as it doesn’t sound like much from a computer vision perspective. But we have to remember that we are not doing any deep learning here. Our models are be very basic. Form this point I can assure you that it’s indeed enough. Especially that based on those images we will generate over 200000 learning examples per model (a decision tree) associated with each point. Power of data multiplication.
The core idea
Now that we’ve got familiar with our data, let’s start thinking about how to use it. I believe that it won’t be a surprise that my final solution tries to map the same number of points (194) in the process of face landmarking. Well, actually at some point I’ve reduced the set of point to achieve better performance but after applying some improvements this was no longer necessary and I’ve came back to the full mask description. But the key is that instead of trying to operate on some undefined cloud of points we can reuse those predefined labels and treat each point as a well defined entity. In other words the 114th point of created mask will always be the left corner of the left eye, the same as it is in the learning data. It might be pretty useful for someone who would want to use the outcome of the processing for further analysis.
Another important premise is that we are going to run this software on a video input. What does this brig us? Continuity. After all it’s not that each frame of a video stream is a completely different image. If there is a face on one frame it’s most likely that it will be there on the next frame as well. And most importantly it’s highly probable that the face is going to be located in the same place, maybe just slightly shifted. That means that we don’t have to search for our landmarks each time from the beginning. All we have to do is to adjust them after we receive a new frame, starting from the previously generated mask. If we need to do some heavy lifting (like finding the initial position of the face) we can do it just once, at the beginning.
There was also a technical challenge to this project – how to get everything working in real time. And it’s not only about optimizing the c++ code (which actually should always be the last step), but primarily about picking the right algorithms and the overall design.
Data preprocessing
Data normalization is a very important step in machine learning, but it’s not too exiting. Still, it’s worth mentioning as I’ve spent some time getting it right in this project.
First of all, I analyse images in HSV (hue, saturation, value) representation instead of the standard RGB. It’s because information saying if given pixel is pink but relatively dark is more descriptive then some weird levels of red, green and blue components.
Secondly, before analysis, an image (a frame) is being scaled so that the size of a face is always roughly the same (200 x 200 px). But how do we know what’s the size of a face before detecting it? Well, we can safely assume that it’s roughly the same as the size of a face detected on the previous frame. Even if that size would change slightly (and due to the nature of the video input it can’t change too much between frames), it’s still close enough.
Finally, all of the HSV components of an image are normalized in a way so that their median values are roughly the same for all learning examples and the actual input data. Not to get into unnecessary details I will just say that I calculate median values within a subset of weighted pixels of a subarea of what I assume to be a face location and then shift all of the pixels accordingly.
Mask regression
Let’s assume that we already have a perfectly aligned mask mapped on our current frame of a video. Now the frame changes so that the person opens his/her mouth a little bit more and turns his/her head slightly.
How to adjust the mask that is not aligned anymore? One of the possible approaches would be to look at the mask as a whole and combined with information obtained from the new image do some magic that would come up with a new, correct mask. This magic is a scary bit and it sounds like a very complex model. Let’s not go this way.
On the other hand we could look at each point independently. This is my approach. The program looks at the color of a small number of pixels closest to the current position of the given point and based on those decides if it has to be moved. This movement – by its nature – should be very small, as we have very limited information based on which we perform it. So my idea is that during a single iteration position of a point can change only by one pixel. Of course it might be (and usually is) that the head movement will be greater, let’s say that it will move by 3 pixels. That’s not a problem. Nothing limits us from performing multiple iterations per single frame. In my case it’s 45.
What features are being used when making a decision if a point has to be moved? Just values of some very basic filter applied on the current position of the point. Actually those filters are so simple that I can easily list them here to give you a better understanding of this approach:
static const std::array<cv::Mat, 5> BasicFilters { cv::Mat{ +1, }, cv::Mat{ +1, 0, -1, +2, 0, -2, +1, 0, -1, }, cv::Mat{ +1, +2, +1, 0, 0, 0, -1, -2, -1, }, cv::Mat{ +1, 0, 0, 0, -1, +2, +1, 0, -1, -2, +4, +2, 0, -2, -4, +2, +1, 0, -1, -2, +1, 0, 0, 0, -1, }, cv::Mat{ +1, +2, +4, +2, +1, 0, +1, +2, +1, 0, 0, 0, 0, 0, 0, 0, -1, -2, -1, 0, -1, -2, -4, -2, -1, } };
So in total we get 15 values. Why 15 if we only have 5 filters? Because, as discussed, we have to apply these filters on three separate layers of the frame: hue, saturation and value (HSV).
As you can see the first filter is just an identity. The rest of them are simple gradients.
After applying those filters, obtained values are fed into a decision tree regressor that runs some checks on them. For example if a pixel under a point assigned to the left side of a chin is suddenly green, it’s probably because this point is currently over the background and has to be moved somewhere to the right. It’s that simple.
Of course each of the 194 points has its own decision tree. Even more, each point actually has two separate decision trees – one for calculating x offset and one for calculating y offset.
Important fact about those points is that their coordinates have a floating point precision. The filters are applied aligned to the whole pixels, but the decision can be partial, what indicates thet the model is not entirely sure what to do. It can propose a movement of just 0.1 pixel along the axis. And that’s ok. It’s still possible to move to the next pixel, it will just require 10 iterations. At the same time it prevents the model from exploding.
And what about the performance? Decision trees are very fast themselves, especially that in this project their max depth is 5. So essentially we just have to do 5 simple comparisons to get a single decision. Of course first we have to extract those local features. What’s important we actually don’t have to extract them for the whole image. We can calculate them ad hoc just in coordinates where our points are currently located. This also guarantees memory access locality. First we calculate all 45 decisions for a single point and just then we move to the next one. So we use a single decision tree description multiple time in a row, operating on very limited fragment of an image.
What about caching of features extracted from an images before making a decision? After all a single mask point can stay within a single pixel for multiple consecutive iterations. Well, there is even better approach. We can cache decisions themselves. If the mask point visits only 5 different pixels during the whole regression process, only 5 decisions will have to be calculated. The rest will be simply looked up in the cache. And the nice thing about this cache is that it can be shared between the points as they are processed one by one. So it’s extremely cheap both from the complexity and memory usage perspective.
Another very important thing when talking about the optimization is that the whole regression process doesn’t allocate (except for allocating the cache during processing of the first frame). This is possible because the number of features (aka filters) is very well defined. Thanks to this all processing can be done in place using some common buffers.
So far I presented all of the advantages of using decision trees for regression. Are there any disadvantages? Or can we just base the whole project just on them?
I will be honest with you. Decision trees are very stupid. And I mean it. They live in their small bubbles seeing only a very limited context and having just one simple output. They don’t even know (store any data) about other points not mentioning their role in the bigger plan (that they are a part of some face).
Some points will get their decisions right. If the learning process is done correctly, most of them should. But there will always be some number of points that will drift away because the parameters were slightly off their expectations. Furthermore, there is barely any difference between local features collected from the neighboring points on the upper lip. From our filters’ perspective it’s just a horizontal line, so one gradient is positive and the others are equal to 0. So how would those points know what is their correct order in that region? We need to add some supervisor that will keep everything in line.
Mask fixing
The previous section started with an assumption that our mask was perfectly aligned, but then we had to update it because of a new frame coming in. Now let’s talk about the next step. After applying our decision trees on individual points we’ve got a mask that, generally speaking, has drifted into a correct direction. Unfortunately this process introduced also some noise. This noise was caused by decision trees not being aware of the broader context of the analysis. And now we have to fix it.
This is a part of the project that I’ve redesigned completely during the last month. Previously denoising was quite a complicated process based on some semi-automatically crafted rules. In short, each connection between two points was represented as a spring with a specific length and roughness. Also, selected angles were annotated with acceptable ranges. If given conditions weren’t satisfied the algorithm would attempt to move connected points in a way that reduces the tension. I admit that I’ve over complicated that. But at the same time I’m still very surprised that I’ve managed to get it right, as it actually worked (at least for lips as adding more points did not end up well). To add on top of that, significant connections between points had to be selected manually. Their limits were calculated based on learning examples, but it’s not a secret that I wasn’t fully satisfied with this approach.
The nice thing about dissatisfaction is that it can be seen as a trigger for a change. So I’ve decided to replace this hideous model with a more elegant solution.
Enough with the backstory, especially that you already know what I’ve picked as the replacement: neural network autoencoders.
Autoencoders are crafty bastards, but the idea behind them is rather simple. They try to reproduce what they get on the input, on their output. For example I could (try to) create a model that takes an image and attempts to reproduce it.
In my case I’ve created an autoencoder that reads 388 floating point values (2 * 194 – one for x and one for y coordinate of each point) and tries to give same values on the output (which then can be merged back into a mask).
Some people will probably kill me for saying that (as it’s usually far from truth), but learning process for autoencoders is fairly simple. We just have to put the same vector of values on the input and on the output and run our optimizer.
So what’s the point of having a model that attempts to replicate it’s input on the output if we can simply copy the input? The clue lays in the word “attempts”. There are different types of autoencoders (like sparse autoencodes) and multiple application for them (denoising is just one of them), but important thing is what lies within them. The general rule is that during the learning process, autoencoder tries to find an alternative, usually reduced, representation for what it sees. For our example we can assume that the mask could be described using it’s width, height, rotation, distance between eyes, degree of mouth openness and maybe few more (but far less then original 388 values). During encoding process the autoencoder would look at the provided examples and try to derive such core features. Next, based on extracted features of a new mask, the similar mask could be reproduced.
The nice thing about this process is that we don’t have to come up with the reduced representation ourselves. For example we don’t have to say that given feature will describe the width of the face. On the other hand, we won’t know what specific value in the feature vector actually means even if we wanted to. But in our case it doesn’t matter as we will treat the autoencoder as a black box. We just want it to take our mask and reproduce it.
The key thing is that the internal representation used by the autoencoder is limited. The autoencoder will be forced to omit information regarding noise on individual points. It will just focus on the basic characteristics.
The funny thing about this is that even if we try to run this software on an empty canvas the mask will still keep its shape. It simply doesn’t know any better – and that’s good.
As I’ve mentioned we don’t have to select any features manually. Actually the whole learning process is as simple as that (it’s the whole content of the autoencoder.py):
import numpy as np import shutil import pathlib from sklearn.neural_network import MLPRegressor from readers.autoencoder_example_reader import read_autoencoder_examples from writers.nn_writer import write_nn data = read_autoencoder_examples('../Data/autoencoder/examples') hidden_layer_sizes = (200, 200) nn = MLPRegressor(hidden_layer_sizes = hidden_layer_sizes, activation = 'relu') x = data[0::2] y = data[1::2] set_size = len(x) train_set_size = int(set_size * 9 / 10) x_train = x[0:train_set_size] y_train = y[0:train_set_size] nn.fit(x_train, y_train) x_test = data[train_set_size:] y_test = data[train_set_size:] score = nn.score(x_test, y_test) print(score) directory = '../Data/regressors/nn' shutil.rmtree(directory, ignore_errors=True) write_nn(directory, 'autoencoder', nn)
And of course applying this autoencoder on a new mask (denoising it) is even simpler.
During the whole discussion about autoencoders I’ve mentioned neural networks just once, at the beginning. It’s because it’s just an implementation detail. But indeed, I use a NN with 2 hidden layers, 200 nodes each, and a RELU activation function. Wait a second, did I just said 200 nodes per hidden layer? What happened with the short representation I’ve mentioned before? Well, it seems that the learning process I’ve used wasn’t powerful enough to find as short description of a mask as I initially assumed. But still, the goal has been achieved. Created autoencoder is capable of denoising provided masks (in a reasonable time) – and that’s enough.
Complete process
And that’s it when it comes to face landmarking. First, each point of the mask is shifted based on some local features of the current frame. And after that the mask is denoised. The process is repeated over and over again for each of the coming frames.
You might remember that I said that each point is adjusted 45 time per single frame. As a detail I will add that the denoising process takes place every 15 iterations, so 3 times per frame.
Also, the initial face location is determined using the Haar filters that come pre-trained with the OpenCV. They are only run once, during the first frame, as after that the face landmarking algorithm takes over. The initial shape of the face is just an average mask derived from learning examples (it get’s adjusted pretty fast)
Learning examples
One more thing that I’ve promised to cover is how I’ve generated learning examples from the dataset. For autoencoder it’s quite simple, we just use the raw data – well, maybe after standardizing the size of faces. But what about decision trees?
Well, it’s not complected either. All we have to do is to take the original image with correctly annotated points and move those points around. A little bit to the left, a little bit to the right and so forth. At each step we have to calculate the feature vector in given location and associate it with a correct decision that would lead to moving the point into the right position.
Another neat trick is to mirror all examples. As faces are usually slightly rotated and illuminated more on one side then on the other, it will effectively double our dataset. One important thing to remember here is that the list of points of a mask has to be properly reshuffled so that the example 114th point will still refer to the left corner of the left eye and not to the right corner of the right eye.
Conclusion
Conclusion is simple: it had fun implementing this project.
But jokes aside, there is a lesson in it. The most complicated models are not always the best and sometimes simple decision trees might be enough. Model simplicity is especially important when we want to achieve high performance. But at the same time it’s harder and harder to live without (at least shallow) neural networks 😀
And now the most difficult part. I have to decide on what the new project is going to be about. See you after I’m done implementing it.
Cheers.
Thanks for putting together this details write up, Tomasz! I’ve got a few projects of my own that I need to continue as well. Hit me up if you ever need some of your own images to be annotated for your future projects ;). Anyways, great stuff, keep it up!
Thanks!
Yeah, It’s usually hard to come back to old projects. Especially that there are always so many new ideas just waiting to be implemented.
I think that my next project will be related to sound synthesis. I’m done with face recognition for now 😀