Information theory and measure theory explained

This article discusses how information theory (a branch of mathematics studying the transmission, processing and storage of information) is related to measure theory (a branch of mathematics related to integration and probability).

Measures in information theory

Η(X)

is usually defined for discrete random variables, whereas for continuous random variables the related concept of differential entropy, written

h(X)

, is used (see Cover and Thomas, 2006, chapter 8). Both these concepts are mathematical expectations, but the expectation is defined with an integral for the continuous case, and a sum for the discrete case.

These separate definitions can be more closely related in terms of measure theory. For discrete random variables, probability mass functions can be considered density functions with respect to the counting measure. Thinking of both the integral and the sum as integration on a measure space allows for a unified treatment.

with range

and probability density function

f(x)

h(X)=-\int_Rf(x)logf(x)dx.

This can usually be interpreted as the following Riemann–Stieltjes integral:

h(X)=-\int_Rf(x)logf(x)d\mu(x),

where

\mu

is the Lebesgue measure.

If instead,

is discrete, with range

\Omega

a finite set,

is a probability mass function on

\Omega

, and

\nu

is the counting measure on

\Omega

, we can write:

Η(X)=-\sum_x\inf(x)logf(x)=-\int_\Omegaf(x)logf(x)d\nu(x).

The integral expression, and the general concept, are identical in the continuous case; the only difference is the measure used. In both cases the probability density function

is the Radon–Nikodym derivative of the probability measure with respect to the measure against which the integral is taken.

is the probability measure induced by

, then the integral can also be taken directly with respect to

h(X)=-\int_\Omegalog

	dP
	d\mu

dP,

If instead of the underlying measure μ we take another probability measure

, we are led to the Kullback–Leibler divergence: let

and

be probability measures over the same space. Then if

is absolutely continuous with respect to

, written

P\llQ,

the Radon–Nikodym derivative

	dP
	dQ

exists and the Kullback–Leibler divergence can be expressed in its full generality:

D_{\operatorname{KL}(P}\|Q) =\int_{\operatorname{supp}P}

	dP
	dQ

log

	dP
	dQ

dQ =\int_{\operatorname{supp}P} log

	dP
	dQ

dP,

where the integral runs over the support of

Note that we have dropped the negative sign: the Kullback–Leibler divergence is always non-negative due to Gibbs' inequality.

Entropy as a "measure"

There is an analogy between Shannon's basic "measures" of the information content of random variables and a measure over sets. Namely the joint entropy, conditional entropy, and mutual information can be considered as the measure of a set union, set difference, and set intersection, respectively (Reza pp. 106–108).

\tildeX

and

\tildeY

to arbitrary discrete random variables X and Y, somehow representing the information borne by X and Y, respectively, such that:

\mu(\tildeX\cap\tildeY)=0

whenever X and Y are unconditionally independent, and

\tildeX=\tildeY

whenever X and Y are such that either one is completely determined by the other (i.e. by a bijection);

where

\mu

is a signed measure over these sets, and we set:

\begin{align} Η(X)&=\mu(\tildeX),\\ Η(Y)&=\mu(\tildeY),\\ Η(X,Y)&=\mu(\tildeX\cup\tildeY),\\ Η(X\midY)&=\mu(\tildeX\setminus\tildeY),\\ \operatorname{I}(X;Y)&=\mu(\tildeX\cap\tildeY); \end{align}

we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal signed measure over sets, as commonly illustrated in an information diagram. This allows the sum of two measures to be written:

\mu(A)+\mu(B)=\mu(A\cupB)+\mu(A\capB)

and the analog of Bayes' theorem (

\mu(A)+\mu(B\setminusA)=\mu(B)+\mu(A\setminusB)

) allows the difference of two measures to be written:

\mu(A)-\mu(B)=\mu(A\setminusB)-\mu(B\setminusA)

This can be a handy mnemonic device in some situations, e.g.

\begin{align} Η(X,Y)&=Η(X)+Η(Y\midX)&\mu(\tildeX\cup\tildeY)&=\mu(\tildeX)+\mu(\tildeY\setminus\tildeX)\\ \operatorname{I}(X;Y)&=Η(X)-Η(X\midY)&\mu(\tildeX\cap\tildeY)&=\mu(\tildeX)-\mu(\tildeX\setminus\tildeY) \end{align}

Note that measures (expectation values of the logarithm) of true probabilities are called "entropy" and generally represented by the letter H, while other measures are often referred to as "information" or "correlation" and generally represented by the letter I. For notational simplicity, the letter I is sometimes used for all measures.

Multivariate mutual information

See main article: Multivariate mutual information. Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the σ-algebra generated by the sets that would be associated to three or more arbitrary random variables. (See Reza pp. 106–108 for an informal but rather complete discussion.) Namely

Η(X,Y,Z, … )

needs to be defined in the obvious way as the entropy of a joint distribution, and a multivariate mutual information

\operatorname{I}(X;Y;Z; … )

defined in a suitable manner so that we can set:

\begin{align} Η(X,Y,Z, … )&=\mu(\tildeX\cup\tildeY\cup\tildeZ\cup … ),\\ \operatorname{I}(X;Y;Z; … )&=\mu(\tildeX\cap\tildeY\cap\tildeZ\cap … ); \end{align}

in order to define the (signed) measure over the whole σ-algebra. There is no single universally accepted definition for the multivariate mutual information, but the one that corresponds here to the measure of a set intersection is due to Fano (1966: p. 57-59). The definition is recursive. As a base case the mutual information of a single random variable is defined to be its entropy:

\operatorname{I}(X)=Η(X)

. Then for

n\geq2

we set

\operatorname{I}(X_1; … ;X_n)=\operatorname{I}(X_1; … ;X_n-1)-\operatorname{I}(X_1; … ;X_n-1\midX_n),

where the conditional mutual information is defined as

\operatorname{I}(X_1; … ;X_n-1\midX_n)=

E
	X_n

(\operatorname{I}(X_1; … ;X_n-1)\midX_n).

The first step in the recursion yields Shannon's definition

\operatorname{I}(X_1;X_2)=Η(X_1)-Η(X_1\midX_2).

The multivariate mutual information (same as interaction information but for a change in sign) of three or more random variables can be negative as well as positive: Let X and Y be two independent fair coin flips, and let Z be their exclusive or. Then

\operatorname{I}(X;Y;Z)=-1

bit.

Many other variations are possible for three or more random variables: for example,

\operatorname{I}(X,Y;Z)

is the mutual information of the joint distribution of X and Y relative to Z, and can be interpreted as

\mu((\tildeX\cup\tildeY)\cap\tildeZ).

Many more complicated expressions can be built this way, and still have meaning, e.g.

\operatorname{I}(X,Y;Z\midW),

Η(X,Z\midW,Y).

References

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, second edition, 2006. New Jersey: Wiley and Sons. .
Fazlollah M. Reza. An Introduction to Information Theory. New York: McGraw–Hill 1961. New York: Dover 1994.
- R. W. Yeung, "On entropy, information inequalities, and Groups." PS

Information theory and measure theory explained

Measures in information theory

Entropy as a "measure"

Multivariate mutual information

References

See also