Inception score explained

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inceptionv3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:

The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".

It has been somewhat superseded by the related Fréchet inception distance. While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").

Definition

Let there be two spaces, the space of images

\Omega_X

and the space of labels

\Omega_Y

. The space of labels is finite.

Let

p_gen

be a probability distribution over

\Omega_X

that we wish to judge.

Let a discriminator be a function of type $p_:\Omega_X \to M(\Omega_Y)$ where

M(\Omega_Y)

is the set of all probability distributions on

\Omega_Y

. For any image

, and any label

, let

p_dis(y|x)

be the probability that image

has label

, according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.

The Inception Score of

p_gen

relative to

p_dis

IS(p_, p_) := \exp\left(\mathbb E_\left[D_{KL} \left(p_{dis}(\cdot | x) \| \int p_{dis}(\cdot | x) p_{gen}(x)dx \right)
	 \right]\right)

Equivalent rewrites include

\ln IS(p_, p_) := \mathbb E_\left[D_{KL} \left(p_{dis}(\cdot | x) \| \mathbb E_{x\sim p_{gen}}[p_{dis}(\cdot | x)]\right)		 \right]

\ln IS(p_, p_) := 		 H[\mathbb E_{x\sim p_{gen}}[p_{dis}(\cdot | x)]]		 -\mathbb E_[H[p_{dis}(\cdot | x)]]

lnIS

is nonnegative by Jensen's inequality.

Pseudocode:

Interpretation

A higher inception score is interpreted as "better", as it means that

p_gen

is a "sharp and distinct" collection of pictures.

lnIS(p_gen,p_dis)\in[0,lnN]

, where

is the total number of possible labels.

lnIS(p_gen,p_dis)=0

iff for almost all

x\simp_gen

p_(\cdot | x) = \int p_(\cdot | x) p_(x)dx

That means

p_gen

is completely "indistinct". That is, for any image

sampled from

p_gen

, discriminator returns exactly the same label predictions

p_dis( ⋅ |x)

The highest inception score

is achieved if and only if the two conditions are both true:

For almost all

x\simp_gen

, the distribution

p_dis(y|x)

is concentrated on one label. That is,

H_y[p_dis(y|x)]=0

. That is, every image sampled from

p_gen

is exactly classified by the discriminator.

For every label

, the proportion of generated images labelled as

is exactly

E
	x\simp_gen

[p_dis(y|x)]=

	1
	N

. That is, the generated images are equally distributed over all labels