Variation of information explained

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1] [2]

Definition

X

and

Y

of a set

A

into disjoint subsets, namely

X=\{X1,X2,\ldots,Xk\}

and

Y=\{Y1,Y2,\ldots,Yl\}

.

Let:

n=\sumi|Xi|=\sumj|Yj|=|A|

pi=|Xi|/n

and

qj=|Yj|/n

rij=|Xi\capYj|/n

Then the variation of information between the two partitions is:

VI(X;Y)=-\sumi,jrij\left[log(rij/pi)+log(rij/qj)\right]

.

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on

A

defined by

\mu(B):=|B|/n

for

B\subseteqA

.

Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact lattice where the partial order induces two operations, the meet

\wedge

and the join

\vee

, where the maximum

\overline{1

} is the partition with only one block, i.e., all elements grouped together, and the minimum is

\overline{0

}, the partition consisting of all elements as singletons. The meet of two partitions

X

and

Y

is easy to understand as that partition formed by all pair intersections of one block of,

Xi

, of

X

and one,

Yi

, of

Y

. It then follows that

X\wedgeY\subseteqX

and

X\wedgeY\subseteqY

.

Let's define the entropy of a partition

X

as

H\left(X\right)=-\sumipilogpi

,

where

pi=|Xi|/n

. Clearly,

H(\overline{1

})=0 and

H(\overline{0

})=\log\,n. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that

X\subseteqYH(X)\geqH(Y)

. Then the VI distance between

X

and

Y

is given by

VI(X,Y)=2H(X\wedgeY)-H(X)-H(Y)

.

The difference

d(X,Y)\equiv|H\left(X\right)-H\left(Y\right)|

is a pseudo-metric as

d(X,Y)=0

doesn't necessarily imply that

X=Y

. From the definition of

\overline{1

}, it is

VI(X,1)=H\left(X\right)

.

If in the Hasse diagram we draw an edge from every partition to the maximum

\overline{1

} and assign it a weight equal to the VI distance between the given partition and

\overline{1

}, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

VI(X,Y)=|VI(X,\overline{1

})\,-\,\mathrm(X\wedge Y,\overline)|\,+\,|\mathrm(Y,\overline)\,-\,\mathrm(X\wedge Y,\overline)| \,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y).

For

H(X)

as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

H(X,Y)=H(X\wedgeY)

and we also have that

d(X,X\wedgeY)=H(X\wedgeY|X)

coincides with the conditional entropy of the meet (intersection)

X\wedgeY

relative to

X

.

Identities

The variation of information satisfies

VI(X;Y)=H(X)+H(Y)-2I(X,Y)

,

where

H(X)

is the entropy of

X

, and

I(X,Y)

is mutual information between

X

and

Y

with respect to the uniform probability measure on

A

. This can be rewritten as

VI(X;Y)=H(X,Y)-I(X,Y)

,

where

H(X,Y)

is the joint entropy of

X

and

Y

, or

VI(X;Y)=H(X|Y)+H(Y|X)

,

where

H(X|Y)

and

H(Y|X)

are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

VI(X;Y)\leqlog(n)

,

Or with respect to a maximum number of clusters,

K*

:

VI(X;Y)\leq2log(K*)

Triangle inequality

To verify the triangle inequality

VI(X;Z)\leqVI(X;Y)+VI(Y;Z)

, expand using the identity

VI(X;Y)=H(X|Y)+H(Y|X)

. It suffices to prove H(X | Z) \leq H(X | Y) + H(Y | Z) The right side has a lower bound H(X | Y) + H(Y | Z) \geq H(X|Y, Z) + H(Y|Z) = H(X, Y|Z) which is no less than the left side.

Further reading

External links

Notes and References

  1. Arabie. P.. Boorman. S. A. . Multidimensional scaling of measures of distance between partitions. Journal of Mathematical Psychology. 1973. 10. 2. 148–203. 10.1016/0022-2496(73)90012-6.
  2. W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)