FaceNet explained

FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated with Google. The system was first presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition.^[1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to a 128-dimensional Euclidean space, and assesses the similarity between faces based on the square of the Euclidean distance between the images' corresponding normalized vectors in the 128-dimensional Euclidean space. The system uses the triplet loss function as its cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63%, which is the highest score to date on the Labeled Faces in the Wild dataset using the unrestricted with labeled outside data protocol.

Structure

Basic structure

The structure of FaceNet is represented schematically in Figure 1.

For training, researchers used input batches of about 1800 images. For each identity represented in the input batches, there were 40 similar images of that identity and several randomly selected images of other identities. These batches were fed to a deep convolutional neural network, which was trained using stochastic gradient descent with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05, which was later lowered while finalizing the model.

Structure of the CNN

The researchers used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.

Structure of the CNN used in the model NN1 in the FaceNet face recognition system
Layer	Size-in (rows × cols × #filters)	Size-out (rows × cols × #filters)	Kernel (rows × cols, stride)	Parameters	FLOPS
conv1	220×220×3	110×110×64	7×7×3, 2	9K	115M
pool1	110×110×64	55×55×64	3×3×64, 2	0	—
rnorm1	55×55×64	55×55×64		0
conv2a	55×55×64	55×55×64	1×1×64, 1	4K	13M
conv2	55×55×64	55×55×192	3×3×64, 1	111K	335M
rnorm2	55×55×192	55×55×192		0
pool2	55×55×192	28×28×192	3×3×192, 2	0
conv3a	28×28×192	28×28×192	1×1×192, 1	37K	29M
conv3	28×28×192	28×28×384	3×3×192, 1	664K	521M
pool3	28×28×384	14×14×384	3×3×384, 2	0
conv4a	14×14×384	14×14×384	1×1×384, 1	148K	29M
conv4	14×14×384	14×14×256	3×3×384, 1	885K	173M
conv5a	14×14×256	14×14×256	1×1×256, 1	66K	13M
conv5	14×14×256	14×14×256	3×3×256, 1	590K	116M
conv6a	14×14×256	14×14×256	1×1×256, 1	66K	13M
conv6	14×14×256	14×14×256	3×3×256, 1	590K	116M
pool4	14×14×256	3×3×256, 2	7×7×256	0
concat	7×7×256	7×7×256		0
fc1	7×7×256	1×32×128	maxout p=2	103M	103M
fc2	1×32×128	1×32×128	maxout p=2	34M	34M
fc7128	1×32×128	1×1×128		524K	0.5M
L2	1×1×128	1×1×128		0

Total				140M	1.6B

Triplet loss function

FaceNet introduced a novel loss function called "triplet loss". This function is defined using triplets of training images of the form

(A,P,N)

. In each triplet,

(called an "anchor image") denotes a reference image of a particular identity,

(called a "positive image") denotes another image of the same identity in image

, and

(called a "negative image") denotes a randomly selected image of an identity different from the identity in image

and

Let

be some image and let

f(x)

be the embedding of

in the 128-dimensional Euclidean space. It shall be assumed that the L2-norm of

f(x)

is unity (the L2 norm of a vector

in a finite dimensional Euclidean space is denoted by

\VertX\Vert

.) We assemble

triplets of images from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets

(A⁽ⁱ⁾,P⁽ⁱ⁾,N⁽ⁱ⁾)

in the training data set:

\Vertf(A⁽ⁱ⁾)-f(P⁽ⁱ⁾

	2
)\Vert
	2

+\alpha<\Vertf(A⁽ⁱ⁾)-f(N⁽ⁱ⁾

	2
)\Vert
	2

The variable

\alpha

is a hyperparameter called the margin, and its value must be set manually. Its value has been set as 0.2.

Thus, the full form of the function to be minimized is the following function, which is officially called the triplet loss function:

	m
\sum
	i=1

max(\Vertf(A⁽ⁱ⁾)-f(P⁽ⁱ⁾

	2
)\Vert
	2

-\Vertf(A⁽ⁱ⁾)-f(N⁽ⁱ⁾

	2
)\Vert
	2

+\alpha,0)

Selection of triplets

In general, the number of triplets of the form

(A⁽ⁱ⁾,P⁽ⁱ⁾,N⁽ⁱ⁾)

is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image

A⁽ⁱ⁾

they chose that positive image

P⁽ⁱ⁾

for which

\Vertf(A⁽ⁱ⁾)-f(P⁽ⁱ⁾

	2
)\Vert
	2

is maximum (such a positive image was called a "hard positive image") and that negative image

N⁽ⁱ⁾

for which

\Vertf(A⁽ⁱ⁾)-f(N⁽ⁱ⁾

	2
)\Vert
	2

is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

Generate triplets offline computing the minimum and maximum on a subset of the data.
Generate triplets online by selecting the hard positive/negative examples from within a mini-batch.

Performance

On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol.^[2] On YouTube Faces DB the system achieved an accuracy of 95.12%.^[1]

Notes and References

Web site: Florian Schroff . Dmitry Kalenichenko . James Philbin . FaceNet: A Unified Embedding for Face Recognition and Clustering . The Computer Vision Foundation . 4 October 2023.
Book: Erik Learned-Miller . Gary Huang . Aruni RoyChowdhury . Haoxiang Li . Gang Hua . Labeled Faces in the Wild: A Survey . Advances in Face Detection and Facial Image Analysis . April 2016 . Springer . 189–248 . 5 October 2023.