Variation of information explained

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.^[1] ^[2]

Definition

and

of a set

into disjoint subsets, namely

X=\{X₁,X₂,\ldots,X_k\}

and

Y=\{Y₁,Y₂,\ldots,Y_l\}

Let:

n=\sum_i|X_i|=\sum_j|Y_j|=|A|

p_i=|X_i|/n

and

q_j=|Y_j|/n

r_ij=|X_i\capY_j|/n

Then the variation of information between the two partitions is:

VI(X;Y)=-\sum_i,jr_ij\left[log(r_ij/p_i)+log(r_ij/q_j)\right]

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on

defined by

\mu(B):=|B|/n

for

B\subseteqA

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact lattice where the partial order induces two operations, the meet

\wedge

and the join

\vee

, where the maximum

\overline{1

} is the partition with only one block, i.e., all elements grouped together, and the minimum is

\overline{0

}, the partition consisting of all elements as singletons. The meet of two partitions

and

is easy to understand as that partition formed by all pair intersections of one block of,

X_i

, of

and one,

Y_i

, of

. It then follows that

X\wedgeY\subseteqX

and

X\wedgeY\subseteqY

Let's define the entropy of a partition

H\left(X\right)=-\sum_ip_ilogp_i

where

p_i=|X_i|/n

. Clearly,

H(\overline{1

})=0 and

H(\overline{0

})=\log\,n. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that

X\subseteqY ⇒ H(X)\geqH(Y)

. Then the VI distance between

and

is given by

VI(X,Y)=2H(X\wedgeY)-H(X)-H(Y)

The difference

d(X,Y)\equiv|H\left(X\right)-H\left(Y\right)|

is a pseudo-metric as

d(X,Y)=0

doesn't necessarily imply that

X=Y

. From the definition of

\overline{1

}, it is

VI(X,1)=H\left(X\right)

If in the Hasse diagram we draw an edge from every partition to the maximum

\overline{1

} and assign it a weight equal to the VI distance between the given partition and

\overline{1

}, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

VI(X,Y)=|VI(X,\overline{1

})\,-\,\mathrm(X\wedge Y,\overline)|\,+\,|\mathrm(Y,\overline)\,-\,\mathrm(X\wedge Y,\overline)| \,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y).

For

H(X)

as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H(X,Y)=H(X\wedgeY)

and we also have that

d(X,X\wedgeY)=H(X\wedgeY|X)

coincides with the conditional entropy of the meet (intersection)

X\wedgeY

relative to

Identities

The variation of information satisfies

VI(X;Y)=H(X)+H(Y)-2I(X,Y)

where

H(X)

is the entropy of

, and

I(X,Y)

is mutual information between

and

with respect to the uniform probability measure on

. This can be rewritten as

VI(X;Y)=H(X,Y)-I(X,Y)

where

H(X,Y)

is the joint entropy of

and

, or

VI(X;Y)=H(X|Y)+H(Y|X)

where

H(X|Y)

and

H(Y|X)

are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

VI(X;Y)\leqlog(n)

Or with respect to a maximum number of clusters,

K^*

VI(X;Y)\leq2log(K^*)

Triangle inequality

To verify the triangle inequality

VI(X;Z)\leqVI(X;Y)+VI(Y;Z)

, expand using the identity

VI(X;Y)=H(X|Y)+H(Y|X)

. It suffices to prove

H(X | Z) \leq H(X | Y) + H(Y | Z)

The right side has a lower bound

H(X | Y) + H(Y | Z) \geq H(X|Y, Z) + H(Y|Z) = H(X, Y|Z)

which is no less than the left side.

External links

Partanalyzer includes a C++ implementation of VI and other metrics and indices for analyzing partitions and clusterings
C++ implementation with MATLAB mex files

Notes and References

Arabie. P.. Boorman. S. A. . Multidimensional scaling of measures of distance between partitions. Journal of Mathematical Psychology. 1973. 10. 2. 148–203. 10.1016/0022-2496(73)90012-6.
W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)

Variation of information explained

Definition

Explicit information content

Identities

Triangle inequality

Further reading

External links

Notes and References