This article discusses how information theory (a branch of mathematics studying the transmission, processing and storage of information) is related to measure theory (a branch of mathematics related to integration and probability).
Η(X)
h(X)
These separate definitions can be more closely related in terms of measure theory. For discrete random variables, probability mass functions can be considered density functions with respect to the counting measure. Thinking of both the integral and the sum as integration on a measure space allows for a unified treatment.
X
R
f(x)
h(X)=-\intRf(x)logf(x)dx.
h(X)=-\intRf(x)logf(x)d\mu(x),
\mu
If instead,
X
\Omega
f
\Omega
\nu
\Omega
Η(X)=-\sumx\inf(x)logf(x)=-\int\Omegaf(x)logf(x)d\nu(x).
f
If
P
X
P
h(X)=-\int\Omegalog
dP | |
d\mu |
dP,
If instead of the underlying measure μ we take another probability measure
Q
P
Q
P
Q
P\llQ,
dP | |
dQ |
D\operatorname{KL}(P\|Q) =\int\operatorname{suppP}
dP | |
dQ |
log
dP | |
dQ |
dQ =\int\operatorname{suppP} log
dP | |
dQ |
dP,
P.
There is an analogy between Shannon's basic "measures" of the information content of random variables and a measure over sets. Namely the joint entropy, conditional entropy, and mutual information can be considered as the measure of a set union, set difference, and set intersection, respectively (Reza pp. 106–108).
\tildeX
\tildeY
\mu(\tildeX\cap\tildeY)=0
\tildeX=\tildeY
where
\mu
\begin{align} Η(X)&=\mu(\tildeX),\\ Η(Y)&=\mu(\tildeY),\\ Η(X,Y)&=\mu(\tildeX\cup\tildeY),\\ Η(X\midY)&=\mu(\tildeX\setminus\tildeY),\\ \operatorname{I}(X;Y)&=\mu(\tildeX\cap\tildeY); \end{align}
we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal signed measure over sets, as commonly illustrated in an information diagram. This allows the sum of two measures to be written:
\mu(A)+\mu(B)=\mu(A\cupB)+\mu(A\capB)
and the analog of Bayes' theorem (
\mu(A)+\mu(B\setminusA)=\mu(B)+\mu(A\setminusB)
\mu(A)-\mu(B)=\mu(A\setminusB)-\mu(B\setminusA)
This can be a handy mnemonic device in some situations, e.g.
\begin{align} Η(X,Y)&=Η(X)+Η(Y\midX)&\mu(\tildeX\cup\tildeY)&=\mu(\tildeX)+\mu(\tildeY\setminus\tildeX)\\ \operatorname{I}(X;Y)&=Η(X)-Η(X\midY)&\mu(\tildeX\cap\tildeY)&=\mu(\tildeX)-\mu(\tildeX\setminus\tildeY) \end{align}
Note that measures (expectation values of the logarithm) of true probabilities are called "entropy" and generally represented by the letter H, while other measures are often referred to as "information" or "correlation" and generally represented by the letter I. For notational simplicity, the letter I is sometimes used for all measures.
See main article: Multivariate mutual information. Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the σ-algebra generated by the sets that would be associated to three or more arbitrary random variables. (See Reza pp. 106–108 for an informal but rather complete discussion.) Namely
Η(X,Y,Z, … )
\operatorname{I}(X;Y;Z; … )
\begin{align} Η(X,Y,Z, … )&=\mu(\tildeX\cup\tildeY\cup\tildeZ\cup … ),\\ \operatorname{I}(X;Y;Z; … )&=\mu(\tildeX\cap\tildeY\cap\tildeZ\cap … ); \end{align}
in order to define the (signed) measure over the whole σ-algebra. There is no single universally accepted definition for the multivariate mutual information, but the one that corresponds here to the measure of a set intersection is due to Fano (1966: p. 57-59). The definition is recursive. As a base case the mutual information of a single random variable is defined to be its entropy:
\operatorname{I}(X)=Η(X)
n\geq2
\operatorname{I}(X1; … ;Xn)=\operatorname{I}(X1; … ;Xn-1)-\operatorname{I}(X1; … ;Xn-1\midXn),
\operatorname{I}(X1; … ;Xn-1\midXn)=
E | |
Xn |
(\operatorname{I}(X1; … ;Xn-1)\midXn).
\operatorname{I}(X1;X2)=Η(X1)-Η(X1\midX2).
\operatorname{I}(X;Y;Z)=-1
Many other variations are possible for three or more random variables: for example,
\operatorname{I}(X,Y;Z)
\mu((\tildeX\cup\tildeY)\cap\tildeZ).
\operatorname{I}(X,Y;Z\midW),
Η(X,Z\midW,Y).