In probability theory and in particular in information theory, total correlation (Watanabe 1960) is one of several generalizations of the mutual information. It is also known as the multivariate constraint (Garner 1962) or multiinformation (Studený & Vejnarová 1999). It quantifies the redundancy or dependency among a set of n random variables.
For a given set of n random variables
\{X1,X2,\ldots,Xn\}
C(X1,X2,\ldots,Xn)
p(X1,\ldots,Xn)
p(X1)p(X2) … p(Xn)
C(X1,X2,\ldots,Xn)\equiv\operatorname{DKL
This divergence reduces to the simpler difference of entropies,
C(X1,X2,\ldots,Xn)=
n | |
\left[\sum | |
i=1 |
H(Xi)\right]-H(X1,X2,\ldots,Xn)
H(Xi)
Xi
H(X1,X2,\ldots,Xn)
\{X1,X2,\ldots,Xn\}
\{X1,X2,\ldots,Xn\}
C(X1,X2,\ldots,Xn)=
\sum | |
x1\inl{X |
1}
\sum | |
x2\inl{X |
2}\ldots
\sum | |
xn\inl{X |
n}p(x1,x2,\ldots,x
|
.
The total correlation is the amount of information shared among the variables in the set. The sum
n | |
\begin{matrix}\sum | |
i=1 |
H(Xi)\end{matrix}
H(X1,X2,\ldots,Xn)
p(X1,X2,\ldots,Xn)
p(X1)p(X2) … p(Xn)
Total correlation quantifies the amount of dependence among a group of variables. A near-zero total correlation indicates that the variables in the group are essentially statistically independent; they are completely unrelated, in the sense that knowing the value of one variable does not provide any clue as to the values of the other variables. On the other hand, the maximum total correlation (for a fixed set of individual entropies
H(X1),...,H(Xn)
Cmax=
n | |
\sum | |
i=1 |
H(Xi)-max\limits
Xi |
H(Xi),
and occurs when one of the variables determines all of the other variables. The variables are then maximally related in the sense that knowing the value of one variable provides complete information about the values of all the other variables, and the variables can be figuratively regarded as cogs, in which the position of one cog determines the positions of all the others (Rothstein 1952).
It is important to note that the total correlation counts up all the redundancies among a set of variables, but that these redundancies may be distributed throughout the variable set in a variety of complicated ways (Garner 1962). For example, some variables in the set may be totally inter-redundant while others in the set are completely independent. Perhaps more significantly, redundancy may be carried in interactions of various degrees: A group of variables may not possess any pairwise redundancies, but may possess higher-order interaction redundancies of the kind exemplified by the parity function. The decomposition of total correlation into its constituent redundancies is explored in a number sources (Mcgill 1954, Watanabe 1960, Garner 1962, Studeny & Vejnarova 1999, Jakulin & Bratko 2003a, Jakulin & Bratko 2003b, Nemenman 2004, Margolin et al. 2008, Han 1978, Han 1980).
Conditional total correlation is defined analogously to the total correlation, but adding a condition to each term. Conditional total correlation is similarly defined as a Kullback-Leibler divergence between two conditional probability distributions,
C(X1,X2,\ldots,Xn|Y=y)\equiv\operatorname{DKL
Analogous to the above, conditional total correlation reduces to a difference of conditional entropies,
C(X1,X2,\ldots,Xn|Y=y)=
n | |
\sum | |
i=1 |
H(Xi|Y=y)-H(X1,X2,\ldots,Xn|Y=y)
Clustering and feature selection algorithms based on total correlation have been explored by Watanabe. Alfonso et al. (2010) applied the concept of total correlation to the optimisation of water monitoring networks.