Face Anti-Spoofing or technologically we recognize a cheater from a thousand by the face

Face Anti-Spoofing or technologically we recognize a cheater from a thousand by the face


The biometric identification of a person is one of the oldest ideas for recognizing people, which they generally tried to implement technically. Passwords can be stolen, spied, forgotten, keys - forged. But the unique characteristics of the person to fake and lose is much more difficult. These can be fingerprints, voice, drawing of retinal vessels, gait, etc.



Of course, biometric systems are trying to fool! Here we will talk about it today. How intruders try to circumvent face recognition systems by posing as another person and how this can be detected.


According to the ideas of Hollywood directors and science fiction writers, it is quite simple to deceive biometric identification. All you need to do is to present the “required parts” of this user to the system, both individually and taking him hostage entirely. Or you can “put on the other person’s identity” on yourself, for example, using physical mask transplantation or In general, presenting false genetic traits


In real life, attackers also try to introduce themselves to someone else. For example, rob a bank by wearing a black man’s mask, as in the picture below.



Face recognition looks like a very promising direction for use in the mobile sector. If everyone has become accustomed to using fingerprints for a long time, and the technologies for working with voice are gradually and fairly predictably developing, then with the identification by face the situation has become quite unusual and worthy of a little insight into the history of the issue.


How it all began or from fiction to reality


Today's recognition systems demonstrate tremendous accuracy. With the advent of large data sets and complex architectures, it became possible to achieve face recognition accuracy up to 0.000001 (one error per million!) And they are already suitable for transfer to mobile platforms. Their vulnerability has become a bottleneck.


In order to impersonate another person in our technical reality, and not in the film, masks are most often used. They are also trying to fool the computer system by presenting someone else instead of their face. Masks come in a completely different quality, from a photo of another person printed on a printer that is held in front of the face, to very complex three-dimensional heated masks. Masks can be presented separately in the form of a sheet or screen, and put on the head.


A great deal of attention was drawn to the successful attempt to trick the Face ID system on the iPhone X with a rather complicated stone powder mask with special eyelets that simulate the heat of a living person using infrared radiation.



It is alleged that using such a mask managed to trick Face ID on iPhone X. Video and some text can be found here


The presence of such vulnerabilities is very dangerous for banking or state user authentication systems in a person where an intruder’s infiltration entails significant losses.


Terminology


The area of ​​study of face anti-spoofing is quite new and still cannot boast of even established terminology.


We agree to call an attempt to deceive the identification system by presenting it with a fake biometric parameter (in this case, a person spoofing attack .


Accordingly, a set of defensive measures to counter such deception will be called anti-spoofing . It can be implemented in the form of a variety of technologies and algorithms embedded in the pipeline of the identification system.


At ISO a somewhat extended set of terminology is proposed, with terms such as the presentation attack - attempts to force the system to incorrectly identify a user or enable him to avoid identification by showing a picture, a recorded video, and so on. Normal (Bona Fide) - corresponds to the usual algorithm of the system, that is, everything that is NOT an attack. Presentation attack instrument means the means of making an attack, for example, an artificially made part of the body. And, finally, Presentation attack detection - automated means of detecting such attacks. However, the standards themselves are still in development, so it’s impossible to talk about any established concepts. There is almost no terminology in Russian.


To determine the quality of the system, often use the HTER metric (Half-Total Error Rate - half of the total error), which is calculated as the sum of the coefficients of mistakenly allowed identifications (FAR - False Acceptance Rate) and mistakenly forbidden identification (FRR - False Rejection Rate), divided in half.
HTER = (FAR + FRR)/2


It should be said that in biometric systems, FAR is usually given the greatest attention in order to do everything possible to prevent the intruder from entering the system. And they are making good progress in this (remember one millionth of the beginning of the article?) The reverse side is the inevitable increase in FRR - the number of ordinary users mistakenly classified as intruders. If these can be sacrificed for state, defense and other similar systems, mobile technologies that work with their enormous scale, variety of subscriber devices and, in general, user-oriented oriented, are very sensitive to any factors that may force users to refuse services. If you want to reduce the number of phones broken against the wall after the tenth consecutive refusal of identification, you should pay attention to the FRR!


Types of attacks. Cheating the system



Let's finally find out exactly how the attackers cheat recognition systems, and also how this can be countered.


The most popular means of deception are masks. There is nothing more obvious than putting on the mask of another person and presenting the face to the identification system (often referred to as the Mask attack).



You can also print a photo of yourself or someone else on a piece of paper and bring it to the camera (let's call this type of attack Printed attack).



A slightly more complicated is the Replay attack, when the system is shown the screen of another device that plays a pre-recorded video with another person. The complexity of execution is compensated by the high efficiency of such an attack, since control systems often use signs based on the analysis of temporal sequences, for example, tracking blinks, micro-movements of the head, presence of facial expressions, breathing, and so on. All this can be easily reproduced on video.



Both types of attacks have a number of features that allow them to be detected, and thus distinguish the screen of a tablet or a piece of paper from a real person.


Let us summarize the characteristic signs that allow defining these two types of attacks into a table:


Printed attack Replay attack
Reduced image quality when printing Moire
Halftoning artifacts when printing to a printer Reflections (highlights)
Mechanical printing artifacts (horizontal lines) Flat picture (no depth)
Lack of local movements (for example, blinks) Image borders may be visible
Image borders may be visible

Attack detection algorithms. Good old classic



One of the oldest approaches (work 2007, 2008) is based on the detection of human blinks by image analysis on the mask. The point is to build a binary classifier that allows you to select images with open and closed eyes in a sequence of frames. This can be an analysis of the video stream using the selection of parts of the face (landmark detection), or the use of a simple neural network. And today this method is most often used; the user is prompted to perform some sequence of actions: twist his head, wink, smile, and so on. If the sequence is random, it’s not easy for an attacker to prepare for it. Unfortunately, for an honest user this quest is also not always overcome, and engagement drops sharply.



You can also use the features of the deterioration of picture quality when printing or playing on the screen. Most likely, even local patterns will be detected in the image, even if they are elusive. This can be done, for example, by counting local binary patterns (LBP, local binary pattern) for different face zones after selecting it from the frame ( PDF ). The described system can be considered the founder of the entire direction of the face anti-spoofing algorithms based on image analysis. In a nutshell, when calculating LBP, each pixel of the image is successively taken, eight of its neighbors, and their intensity is compared. If the intensity is greater than on the central pixel, one is assigned; if less, zero. Thus, for each pixel, an 8-bit sequence is obtained. Based on the received sequences, a pixel-by-pixel histogram is built, which is fed to the input of the SVM classifier.



Local Binary Patterns, Histogramming, and SVM. You can join the timeless classics by link


The HTER performance indicator is “as much as” 15%, and means that a significant proportion of intruders overcome the defense without much effort, although it should be recognized that many are eliminated. The algorithm was tested on the dataset Replay-Attack from IDIAP, which is composed of 1200 short videos of 50 respondents and three types of attacks - printed attack, mobile attack, high-definition attack.


The ideas of image texture analysis were continued. In 2015, Bukinafit developed an alternative image dividing algorithm into channels, in addition to the traditional RGB, for the results of which local binary patterns were again calculated, which, as in the previous method, were fed to the SVN classifier input. The accuracy of HTER calculated on CASIA data sets and Replay-Attack, was impressive at that time 3%.



At the same time, there were works on the detection of moire. Patel published an article where he suggested looking for image artifacts in a periodic pattern caused by overlapping two scans. The approach turned out to be workable, showing HTER about 6% on the IDIAP, CASIA and RAFS data sets. This was also the first attempt to compare the performance of the algorithm on different data sets.



Periodic pattern on the image caused by overlapping scans


To detect attempts at presenting photos, the logical solution was to try to analyze not one image, but their sequence taken from the video stream. For example, Anjos and his colleagues suggested to isolate the signs from the optical stream on adjacent pairs of frames, input the binary classifier, and send a classifier classifier . The approach proved to be quite effective, demonstrating HTER 1.52% on their own data set.



It looks interesting method of tracking movements, which is somewhat apart from the generally accepted approaches. Since in 2013 the principle “to submit a raw image to the input of a convolutional network and adjust the grid layers to obtain the result” was not usual for modern projects in the field of deep learning, Bharadvage is consistently applied more complex preliminary conversions. In particular, he applied the Euler video gain Eulerian video magnification , known from the MIT scientists, which was successfully used for the analysis of color changes of the skin depending on the pulse. Replaced LBP with HOOF (histograms of the directions of the optical flow), correctly noticing that if we want to track movements, we need the corresponding signs, and not just texture analysis. The same SVM, traditional at that time, was used as a classifier. The algorithm showed extremely impressive results on Print Attack (0%) and Replay Attack (1.25%)



Let's learn the grids already!



From a certain point it became obvious that the transition to deep learning had matured. The notorious "deep learning revolution" overtook face anti-spoofing.


The “first swallow” can be considered a method for analyzing depth maps in certain areas (“patches”) of an image. Obviously, the depth map is a very good indication for determining the plane in which the image is located. If only because the image on a sheet of paper has no “depth” by definition. In Ataum 2017 many different small sections were extracted from the image, they were calculated depth maps, which were then merged with the depth map of the main image. It was pointed out that ten random patches of the face image are enough to reliably detect the Printed Attack.Additionally, the authors merged together the results of the work of two convolutional neural networks, the first of which calculated depth maps for patches, and the second for the image as a whole. When training on data sets, the Printed Attack class was associated with a depth map equal to zero, and a series of randomly selected plots was associated with a 3D face model. By and large, the depth map itself was not so important; only some indicator function was used to characterize the “depth of the plot”. The algorithm showed a HTER value of 3.78%. Three public data sets were used for training - CASIA-MFSD, MSU-USSA and Replay-Attack.



Unfortunately, the availability of a large number of excellent frameworks for deep learning has led to the emergence of a huge number of developers who are trying to solve the problem of face anti-spoofing in the well-known way of assembling neural networks. Usually it looks like a stack of attribute cards at the outputs of several networks, pre-trained in any widely used dataset, which is fed to a binary classifier.



In general, it is worthwhile to conclude that quite a lot of works have been published so far, which generally show good results, and which are united by only one small “but”. All these results are demonstrated within one specific dataset! The situation is aggravated by the limitations of the available data sets and, for example, on the notorious Replay-Attack, no one will be surprised by HTER 0%. All this leads to the emergence of very complex architectures, for example, such , using various tricky features, auxiliary algorithms collected on the stack, with several classifiers, the results of which are averaged, and so on ... At the output, the authors get HTER = 0.04%!



This suggests that the face anti-spoofing task has been resolved within a specific dataset. Let us tabulate the various modern methods based on neural networks. As it is easy to see, “reference results” were achieved by very diverse methods that have only emerged in the inquisitive minds of developers.



Comparative results of various algorithms. The table is taken from here .


Unfortunately, the same “small” factor violates the benevolent picture of the struggle for tenths of a percent. If you try to train a neural network on one data set, and apply it on another, the results will be ... not so optimistic. Worse, attempts to apply classifiers in real life leave no hope at all.
For example, take the data of work 2015, where the metric of its quality was used to determine the authenticity of the presented image. See for yourself:



In other words, the algorithm trained on the Idiap data and applied on the MSU will give a true positive detection rate of 90.5%, and if you do the opposite (train on the MSU and test it on the Idiap), then you can only correctly determine 47.2% (!) For other combinations, the situation worsens even more, and, for example, if you train the algorithm on MSU and check on CASIA, then TPR will be 10.8%! This means that a huge number of honest users were mistakenly attributed to the attackers, which cannot but be depressing. The situation could not be reversed even by cross-database training, which seems to be quite a reasonable way out.


Let's see more.The results given in the article of Patel 2016 show that even with fairly complex processing pipelines and selection such reliable signs as blinking and texture, the results on unfamiliar data sets cannot be considered satisfactory. So, at some point it became quite obvious that the proposed methods are desperately not enough to summarize the results.



And if you arrange a competition ...


Of course, in the area of ​​face anti-spoofing has not been without competition. In 2017, a competition was held at the University of Oulu in Finland on its own new dataset with quite interesting protocols oriented, as a matter of fact, for use in the field of mobile applications.


- Protocol 1: There is a difference in lighting and background. Data sets are recorded in various places and differ in background and lighting.


- Protocol 2: Various models of printers and screens are used for the attacks. So, in the test dataset, a technique is used that is not found in the training set


- Protocol 3: Interchangeability of sensors. The video of the current user and the attacks are recorded on five different smartphones and are used in the training dataset. To test the algorithm, video from another smartphone is used, which is not included in the training set.


- Protocol 4: includes all of the above factors.


The results were quite unexpected. As in any competition, there was no time to come up with brilliant ideas, so almost all participants took familiar architectures and refined them with fine-tuning, working with signs and trying to somehow use other data sets for training. The prize decision showed an error on the fourth, the most complex protocol, about 10%. A brief description of the winners' algorithms in the table below:


  1. GRADIENT


    • Fusion of features by color (using HSV and YCbCr color spaces), texture and motion is performed.
    • Dynamics information is retrieved from a given video sequence and time change maps in a separate frame.
    • This sequence is separately applied across all channels in the HSV and YCbCr color spaces, giving together a pair of three-channel images. For each image, the ROI (region-of-interest) is cropped based on the position of the eyes in the frame sequence and scaled to 160 × 160 pixels ..
    • Each ROI is divided into 3 × 3 and 5 × 5 rectangular areas, which are used to extract uniform LBP histograms of features, which are combined into two feature vectors of dimension 6018.
    • With the help of recursive feature removal (Recursive Feature Elimination), the dimension is reduced from 6018 to 1000.
    • For each feature vector, a SVM-based classification is performed, followed by averaging. |

  2. SZCVI


    • A sample of frames is extracted from each video, every sixth frame is taken
    • Frame up to 216 × 384
    • Five VGG-like layers
    • The results of individual frames within the sample are averaged

  3. Recod


    • SqueezeNet is studying at Imagenet
    • Transfer learning on two data sets: CASIA and UVAD
    • First, the face is detected and scaled to 224 × 224 pixels. Every seventh frame, which is sent to ten CNN, is extracted from each video of the training dataset.
    • To get the final result, the indicators of individual frames are averaged.
    • To improve efficiency, the obtained figures are reduced to a generalized result of the base method

  4. CPqD


    • Inception-v3 network trained on ImageNet
    • Sigmoid activation function
    • Based on the positioning of the eyes, the image sections containing the face are cropped, which are then scaled to 224 × 224 RGB frames.


It is clearly seen that not many new ideas have appeared. All the same LBP, pre-trained grids, texture and color analysis, pairwise frame analysis, etc. GRADIANT looks the most competently designed from a system point of view, it mixes various signs, works in different color spaces, and signs are being cleaned. He won the competition.


The competition very clearly showed the existing limitations. First of all, it is the limitations and lack of balance of the existing datasets for training. First, they present a fairly limited number of people (from 15 people in NUAA to 1140 in MSU-USSA) and sessions, the difference in external lighting, facial expressions, recording devices used, shooting angles and types of attacks. At the same time, in real-world conditions, the model of the camera, the quality of the matrix, the shooting conditions, the focal length and shutter speed, the background and the situation are often decisive for image analysis. Secondly, the analysis methods themselves are much more focused on the analysis of individual sections of the image without significant processing of the scene environment itself. For example, in the CASIA set, many examples of attacks are presented as an image of a person who is holding a photo in front of him. Obviously, the characteristic position of the hands, the borders of the photo sheet, the hair, neck and head and so on can be seen ... But there were no solutions using the analysis of the entire scene and the position of the person, all the algorithms worked only with the section selected from the entire scene faces.



Another promising competition was recently proposed on a new 30 GB self-developed dataset. According to the conditions of the competition, the mask must be detected, the fact that the printed photo was taken and the video was shown on the screen instead of the real face. It is likely that according to its results we will see a conceptually new solution.


Of course, there are solutions based on "non-standard approaches." We turn to them with the hope of improving the current state of affairs. For example, it was it was suggested to use the method of distance photoplethismography (rPPgrafcgrcgcgpg.php.php.php.php.php.acc/five/en/content.thecvf.com/content_ECCV_2018/off//i/siqi_Liu_Remote_Photoplethysmography_Correspondence_ECCV_2018_paper.pdf"> it was suggested . The idea is that when light enters the living face of a person, part of the light will be reflected, some will dissipate, and some will be absorbed by the skin and facial tissues. In this case, the picture will be different depending on the degree of tissue filling with blood. Thus, it is possible to track the pulsation of blood in the vessels of the face and, accordingly, to detect the pulse. Of course, if you cover your face with a mask or show your phone screen, no pulsation can be detected. On this principle, Liu and co-authors suggested splitting the face image into areas, detecting the pulse using remote photoplethysmography, comparing the different pulse counting sections in pairs and building maps to detect the presence or absence of a mask, as well as comparing the pulse to different parts of the face.




The work showed the value of HTER about 10%, confirming the fundamental applicability of the method.There are several more works confirming the viability of this approach.
(CVPR 2018) J. H.-Ortega et al. Time Analysis of Pulsebased Anti-Spoofing and Anti-Soping Anti-Spoofing and Anti-Virus

Generalized face anti-spoofing video
(2016) J. Chen et al. Realsense = real heart rate: Illumination rate rates
(2014) H. E. Tasli et al. Remote PPG


In 2018, Liu and colleagues from the University of Michigan proposed reject the binary classification in favor of an approach that they called “binary supervision” - that is, the use of a more complex assessment based on the depth map and distance photo plethysmography. For each of the real face images, a three-dimensional model was reconstructed using neural networks and called it with a depth map. Fake images were assigned a depth map consisting of zeros, in the end it’s just a piece of paper or a device screen! These characteristics were taken for "truth", neural networks were trained on their own SiW data set. Then, a three-dimensional face mask was superimposed on the input image, a depth map and pulse were calculated for it, and all this was connected together in a rather complex conveyor. As a result, the method showed an accuracy of about 10 percent on the OULU competitive dataset. Interestingly, the winner of the competition, organized by the University of Oulu, built an algorithm on binary classification patterns, blink tracking, and other “hand-crafted” signs, and its solution also had an accuracy of about 10%. The gain was only about half a percent! The benefit of the new combination technology is that the algorithm was trained on its own data set, and tested on OULU, improving the result of the winner. That says about some portability of results from dataset to dataset, and what the hell is not joking, it is also possible for real life. However, when trying to perform training on other datasets - CASIA and ReplayAttack, the result was again about 28%. Of course, this is superior to the performance of other algorithms when training on different data sets, but with such accuracy values ​​there can be no talk of any industrial use!



Another approach was proposed by Wang and his colleagues in a fresh work of 2019. It was noted that when analyzing micro-movements of the face, there are noticeable turns and displacements of the head, leading to a characteristic change in the angles and relative distances between the signs on the face. So with the displacement of the face to the sides horizontally, the angle between the nose and the ear increases. But, if in the same way to displace a sheet of paper with a picture, the angle will decrease! For illustration, it is worth quoting a drawing from the work.



On this principle, the authors built a whole learning block to transfer data between layers of a neural network. It took into account the “wrong offsets” for each frame in a sequence of two frames, and this allowed us to use the results in the next block of long-term dependency analysis based on the GRU Gated Recurrent Unit .Then all signs were concatenated, the loss function was counted and the final classification was performed. This made it possible to slightly improve the result on the OULU data set, but the problem of dependence on the training data remained, since the figures for CASIA-MFSD and Replay-Attack were 17.5 and 24 percent, respectively.


By the end of the day, it’s worth noting Tencent’s work who suggested changing the actual way to get the original video. Instead of passive observation of the scene, they offered to dynamically illuminate the face and read the reflections. The principle of active irradiation of an object has long been used in location systems of various kinds, therefore, its use for the study of a face looks very logical. Obviously, there are not enough signs for reliable identification in the image itself, and lighting the screen of a phone or tablet with a sequence of light symbols (light CAPTCHA according to the authors' terminology) can help a lot. Next, the difference in scattering and reflection over a pair of frames is determined, and the results are fed to a multitasking neural network for further processing using a depth map and calculating various loss functions. At the end, the regression of normalized light frames is performed. The authors did not analyze the generalizing ability of their algorithm on other data sets and trained it in their own closed dataset. The result is about 1% and it is reported that the model has already been deployed for actual use.



Until 2017, the face anti-spoofing area was not very active. But 2019 has already presented a whole series of works, which is connected with the aggressive promotion of mobile identification technologies on the face, first of all, by Apple. In addition, banks are interested in face recognition technology. A lot of new people came to the industry, which gives hope for rapid progress. But so far, despite the beautiful titles of publications, the generalizing ability of algorithms remains very weak and does not allow us to speak of any suitability for practical use.


Conclusion. And finally, I will say that ...


  • Local binary patterns, blink tracking, breathing, movements, and other hand-designed features have completely lost their meaning. This is caused, first of all, by the fact that deep learning in the field of face anti-spoofing is still very naive.
  • It is quite obvious that in the "that same" solution several methods will be merged. Analysis of reflection, scattering, depth maps should be used together. Most likely, the addition of an additional data channel, for example, voice recording and some system approaches that will allow to collect several technologies into a single system, will help.
  • Almost all technologies used for face recognition are used in face anti-spoofing (cap!) Everything that has been developed for face recognition has been applied in one form or another for analyzing attacks
  • Existing datasets have reached saturation. Of the ten basic data sets in five, a zero error was achieved. This already speaks, for example, of the working capacity of methods based on depth maps, but it does not make it possible to improve the generalizing ability. We need new data and new experiments on them
  • There is a clear imbalance between the degree of development of facial recognition and face anti-spoofing. Recognition technologies are significantly ahead of the protection system. Moreover, it is the lack of reliable protection systems that inhibits the practical use of face recognition systems. It so happened that the focus was precisely on facial recognition, and the attack detection systems were somewhat out of the way
  • There is a strong need for a systematic approach in the area of ​​face anti-spoofing.The last competition of the University of Oulu showed that by using a non-representative data set, it is quite possible to win by simply setting up the well-established solutions, without developing new ones. Perhaps a new competition can turn the tide on
  • With the increasing interest in the topic and the introduction of face recognition technologies, major players have created “windows of opportunity” for new ambitious teams, as there is a serious need for a new solution at the architecture level

Source text: Face Anti-Spoofing or technologically we recognize a cheater from a thousand by the face