FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated with Google. The system was first presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition.[1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to a 128-dimensional Euclidean space, and assesses the similarity between faces based on the square of the Euclidean distance between the images' corresponding normalized vectors in the 128-dimensional Euclidean space. The system uses the triplet loss function as its cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63%, which is the highest score to date on the Labeled Faces in the Wild dataset using the unrestricted with labeled outside data protocol.
The structure of FaceNet is represented schematically in Figure 1.
For training, researchers used input batches of about 1800 images. For each identity represented in the input batches, there were 40 similar images of that identity and several randomly selected images of other identities. These batches were fed to a deep convolutional neural network, which was trained using stochastic gradient descent with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05, which was later lowered while finalizing the model.
The researchers used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.
Layer | Size-in (rows × cols × #filters) | Size-out (rows × cols × #filters) | Kernel (rows × cols, stride) | Parameters | FLOPS | |
---|---|---|---|---|---|---|
conv1 | 220×220×3 | 110×110×64 | 7×7×3, 2 | 9K | 115M | |
pool1 | 110×110×64 | 55×55×64 | 3×3×64, 2 | 0 | — | |
rnorm1 | 55×55×64 | 55×55×64 | 0 | |||
conv2a | 55×55×64 | 55×55×64 | 1×1×64, 1 | 4K | 13M | |
conv2 | 55×55×64 | 55×55×192 | 3×3×64, 1 | 111K | 335M | |
rnorm2 | 55×55×192 | 55×55×192 | 0 | |||
pool2 | 55×55×192 | 28×28×192 | 3×3×192, 2 | 0 | ||
conv3a | 28×28×192 | 28×28×192 | 1×1×192, 1 | 37K | 29M | |
conv3 | 28×28×192 | 28×28×384 | 3×3×192, 1 | 664K | 521M | |
pool3 | 28×28×384 | 14×14×384 | 3×3×384, 2 | 0 | ||
conv4a | 14×14×384 | 14×14×384 | 1×1×384, 1 | 148K | 29M | |
conv4 | 14×14×384 | 14×14×256 | 3×3×384, 1 | 885K | 173M | |
conv5a | 14×14×256 | 14×14×256 | 1×1×256, 1 | 66K | 13M | |
conv5 | 14×14×256 | 14×14×256 | 3×3×256, 1 | 590K | 116M | |
conv6a | 14×14×256 | 14×14×256 | 1×1×256, 1 | 66K | 13M | |
conv6 | 14×14×256 | 14×14×256 | 3×3×256, 1 | 590K | 116M | |
pool4 | 14×14×256 | 3×3×256, 2 | 7×7×256 | 0 | ||
concat | 7×7×256 | 7×7×256 | 0 | |||
fc1 | 7×7×256 | 1×32×128 | maxout p=2 | 103M | 103M | |
fc2 | 1×32×128 | 1×32×128 | maxout p=2 | 34M | 34M | |
fc7128 | 1×32×128 | 1×1×128 | 524K | 0.5M | ||
L2 | 1×1×128 | 1×1×128 | 0 | |||
Total | 140M | 1.6B |
FaceNet introduced a novel loss function called "triplet loss". This function is defined using triplets of training images of the form
(A,P,N)
A
P
A
N
A
P
Let
x
f(x)
x
f(x)
X
\VertX\Vert
m
(A(i),P(i),N(i))
\Vertf(A(i))-f(P(i)
2 | |
)\Vert | |
2 |
+\alpha<\Vertf(A(i))-f(N(i)
2 | |
)\Vert | |
2 |
The variable
\alpha
Thus, the full form of the function to be minimized is the following function, which is officially called the triplet loss function:
L=
m | |
\sum | |
i=1 |
max(\Vertf(A(i))-f(P(i)
2 | |
)\Vert | |
2 |
-\Vertf(A(i))-f(N(i)
2 | |
)\Vert | |
2 |
+\alpha,0)
In general, the number of triplets of the form
(A(i),P(i),N(i))
A(i)
P(i)
\Vertf(A(i))-f(P(i)
2 | |
)\Vert | |
2 |
N(i)
\Vertf(A(i))-f(N(i)
2 | |
)\Vert | |
2 |
On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol.[2] On YouTube Faces DB the system achieved an accuracy of 95.12%.[1]