In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.
If
X
p(x)
X
H(X)~=~-
infty | |
\int | |
-infty |
p(x) logp(x) dx~.
If
X
\operatorname{Pr}\{ X=xk\}=pk ~for~ k=1, 2, \ldots~
X
H(X)=-\sumk\ge pk logpk~.
The seemingly divergent term
p(x) logp(x)
p(x)=0~.
This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing
H(X)
The base of the logarithm is not important, as long as the same one is used consistently: Change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists often prefer the natural logarithm, resulting in a unit of "nat"s for the entropy.
dx
Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.
Suppose
S
R
n
f1, … , fn
n
a1, \ldots, an~.
C
S
S
n
E\{ fj(X)\} \geqaj ~for~ j=1, \ldots, n
If there is a member in
C
S ,
C ,
p(x)
p(x)=\exp
n λ | |
\left( \sum | |
j |
fj(x) \right) ~forall~ x\inS
where we assume that
f0(x)=1~.
λ0
n
\boldsymbolλ=(λ1, \ldots, λn)
a0=1
p
max | |
λ0; \boldsymbolλ |
n | |
\left\{ \sum | |
j=0 |
λjaj-\int \exp\left(
n | |
\sum | |
j=0 |
λjfj(x)\right) dx \right\} ~subject to~ \boldsymbolλ\geq0
Using the Karush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution because the objective function in the optimization is concave in
\boldsymbolλ~.
Note that when the moment constraints are equalities (instead of inequalities), that is,
E\{ fj(X) \}~=~aj ~for~ j=1, \ldots, n ,
then the constraint condition
\boldsymbolλ\geq0
Suppose
S=\{ x1, x2, \ldots \}
n
f1, \ldots ,fn
n
a1, \ldots ,an~.
C
X
S
n
\operatorname{E}\{ fj(X) \}\geqaj ~for~ j=1, \ldots ,n
If there exists a member of class
C
S
C ,
\operatorname{P}\{ X=xk \}=
n | |
\exp\left( \sum | |
j=0 |
λj fj(xk) \right) ~for~ k=1, 2, \ldots
where we assume that
f0=1
λ0, \boldsymbolλ\equiv(λ1, \ldots ,λn)
a0=1~:
max | |
λ0; \boldsymbolλ |
n λ | |
\left\{ \sum | |
j a |
j-\sumk
n λ | |
\exp\left( \sum | |
j f |
j(xk) \right) \right\} ~for which~ \boldsymbolλ\geq0
Again as above, if the moment conditions are equalities (instead of inequalities), then the constraint condition
\boldsymbolλ\geq0
In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers. The constraints can be written as
infty | |
\int | |
-infty |
fj(x)p(x)dx=aj
We consider the functional
infty | |
J(p)=\int | |
-infty |
p(x)ln{p(x)}dx-η0\left(\int
infty | |
-infty |
n | |
p(x)dx-1\right)-\sum | |
j=1 |
λj\left(\int
infty | |
-infty |
fj(x)p(x)dx-aj\right)
where
η0
λj,j\geq1
n
\deltaJ | |
\deltap |
\left(p\right)=ln{p(x)}+1-η0-\sum
n | |
j=1 |
λjfj(x)=0
Therefore, the extremal entropy probability distribution in this case must be of the form (
λ0:=η0-1
-1+η0 | |
p(x)=e |
⋅
| ||||||||||
e |
=
n | |
\exp\left(\sum | |
j=0 |
λjfj(x)\right) ,
remembering that
f0(x)=1
Suppose
p, p'
\alpha\inl(0,1r)
q=\alpha ⋅ p+(1-\alpha) ⋅ p'
supp(q)=supp(p)\cupsupp(p')~.
l{H}(q)\geq\alpha l{H}(p)+(1-\alpha) l{H}(p')~.
\alpha\longrightarrow1
\alpha\longrightarrow0 ,
l{H}(q)\geql{H}(p),l{H}(p')~.
It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support — i. e. the distribution is almost everywhere strictly positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).
Suppose
p, p'
\vecλ, \vecλ'\inRn
p(x)=
\exp{l\langle\vecλ,\vec{f | |
(x) |
r\rangle} }{ C(\vecλ) }
p' ,
C(\vec{λ})=\intx\exp{l\langle\vecλ,\vec{f}(x)r\rangle} {d}x~.
{D logC( ⋅ )} {|}=\tfrac{D C( ⋅ )}{C( ⋅ )} {|}\vecλ=Ep\left\{ \vec{f}(X) \right\}=\vec{a}
\vecλ'~.
u=\vecλ'-\vecλ\inRn
0=l\langleu,\vec{a}-\vec{a}r\rangle =DulogC( ⋅ ) {|}-DulogC( ⋅ ) {|}=
2 | |
D | |
u |
logC( ⋅ ) {|}\vec\gamma
where
\vec\gamma=\theta \vecλ+(1-\theta)\vecλ'
\theta\inl(0,1r)~.
\begin{array}{rcl} 0&=&
2 log | |
D | |
u |
C( ⋅ ) {|}\\ &=&
D | ||||
|
\right) {|}\\ &=&
| ||||||||||
C( ⋅ ) |
{|}-
( DuC( ⋅ ))2 | |
C( ⋅ )2 |
{|}\\ &=&Eq\left\{ {l\langleu,\vec{f}(X)r\rangle}2 \right\}-\left(Eq\left\{ l\langleu,\vec{f}(X)r\rangle \right\}\right)2\ {}\\ &=&\operatorname{Var}q\{ l\langleu,\vec{f}(X)r\rangle \}\ {} \end{array}
where
q
\vec\gamma~,
\langleu,\vec{f}(X)\rangle
u=0~.
\vecλ'-\vecλ=u=0 ,
p, p'
Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy.[6] It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.
Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density as
p(x)=\exp{(ln{p(x)})}
ln{p(x)} → f(x)
\int\exp{(f(x))}f(x)dx=-H
to be the constant,
p(x)
\intp(x)f(x)dx=-H
Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure
ln{p(x)} → f(x)
f(x)
A table of examples of maximum entropy distributions is given in Lisman (1972)[7] and Park & Bera (2009).[8]
The uniform distribution on the interval [''a'',''b''] is the maximum entropy distribution among all continuous distributions which are supported in the interval [''a'', ''b''], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivision a=a0 < a1 < ... < ak = b of the interval [''a'',''b''] and probabilities p1,...,pk that add up to one, then we can consider the class of all continuous distributions such that
\operatorname{Pr}(aj-1\leX<aj)=pj forj=1,\ldots,k
p(x|λ)=\begin{cases} λe-λ&x\ge0,\\ 0&x<0, \end{cases}
is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ. In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.<ref>{{Cite journal |last1=Dowson |first1=D. |last2=Wragg |first2=A. |date=September 1973 |title=Maximum-entropy distributions having prescribed first and second moments |type=correspondance |journal=IEEE Transactions on Information Theory |volume=19 |issue=5 |pages=689–693 |doi=10.1109/tit.1973.1055060 |issn=0018-9448}}</ref> === Specified mean and variance: the normal distribution === The [[normal distribution]] N(μ,σ2), for which the density function is
p(x|\mu,\sigma)=
1 | |
\sigma\sqrt{2\pi |
}
| ||||||
e |
,
has maximum entropy among all real-valued distributions supported on (-∞,∞) with a specified variance σ2 (a particular moment). The same is true when the mean μ and the variance σ2 is specified (the first two moments), since entropy is translation invariant on (-∞,∞). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments. (See the differential entropy article for a derivation.)
Among all the discrete distributions supported on the set with a specified mean μ, the maximum entropy distribution has the following shape:
\operatorname{Pr}(X=xk)=
xk | |
Cr |
fork=1,\ldots,n
For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with = and μ = S/N.
Finally, among all the discrete distributions supported on the infinite set
\{x1,x2,...\}
\operatorname{Pr}(X=xk)=
xk | |
Cr |
fork=1,2,\ldots,
C=
1 | |
\mu-1 |
, r=
\mu-1 | |
\mu |
,
such that respective maximum entropy distribution is the geometric distribution.
For a continuous random variable
\thetai
When the mean and variance of the angles
\thetai
2\pi
There exists an upper bound on the entropy of continuous random variables on
R
p(x)=c\exp{(λ1x+λ
3)} | |
3x |
λ3 ≠ 0
However, the maximum entropy is -achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.
This is a special case of the general case in which the exponential of any odd-order polynomial in x will be unbounded on
R
ceλ
R
Every distribution with log-concave density is a maximal entropy distribution with specified mean and deviation risk measure .[10]
In particular, the maximal entropy distribution with specified mean
E(X)\equiv\mu
D(X)\equivd
l{N}\left(m,d2\right) ,
D(X)=\sqrt{ \operatorname{E}\left\{ \left(X-\mu\right)2 \right\} }
D(X)=\operatorname{E}\left\{ \left| X-\mu\right| \right\}
~f(x)=c \exp\left( a x+b\downharpoonrightx-\mu
2 \right) | |
\downharpoonleft | |
- |
~
~D(X)=\sqrt{ \operatorname{E}\left\{ \downharpoonrightX-\mu
2 \right\} | |
\downharpoonleft | |
- |
~}~
a,b,c
\downharpoonrighty\downharpoonleft- \equiv min\left\{ 0, y \right\}~ forany~ y\inR ,~
In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that
x
Several listed examples (Bernoulli, geometric, exponential, Laplace, Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity.
For reference,
\Gamma(x)=
infty | |
\int | |
0 |
e-ttx-1 {d}t
\psi(x)=
d | |
{d |
x }ln\Gamma(x)=
\Gamma'(x) | |
\Gamma(x) |
B(p,q)=
\Gamma(p) \Gamma(q) | |
\Gamma(p+q) |
\gammaE
Distribution name | Probability density / mass function | Maximum Entropy constraint | Support | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Uniform (discrete) | f(k)=
| None | \{a,a+1,...,b-1,b\} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Uniform (continuous) | f(x)=
| None | [a,b] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bernoulli | f(k)=pk(1-p)1-k | \operatorname{E}[ K ]=p | \{0,1\} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Geometric | f(k)=(1-p)k-1 p | \operatorname{E}[ K ]=
| N\smallsetminus\left\{0\right\}=\{1,2,3,...\} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Exponential | f(x)=λ\exp\left(-λx\right) | \operatorname{E}[ X ]=
| [0,infty) f(x)=
\right) \operatorname{E}[ |X-\mu| ]=b (-infty,infty)
where ~s\equivsgn(x-m) \operatorname{E}[ (X-m) s \kappas ]=
(-infty,infty) f(x)=
\operatorname{E}[ lnX ]=
+ln(xm) [xm,infty) f(x)=
\operatorname{E}[ X ] =\mu ,~ \operatorname{E}[ X2 ]=\sigma2+\mu2 (-infty,infty) \operatorname{E}[ X ] =\muT ,~ \operatorname{E}[ X2 ]=
+
[a,b] f(\theta)=
\exp{(\kappa\cos{(\theta-\mu)})} \operatorname{E}[ \cos\Theta ]=
\cos\mu ,~ \operatorname{E}[ \sin\Theta ]=
\sin\mu [0,2\pi) f(x)=
\right) \operatorname{E}[ X2 ] =2\sigma2 ,~ \operatorname{E}[ lnX ]=
[0,infty) f(x)=
0\leqx\leq1 \operatorname{E}[ lnX ]=\psi(\alpha)-\psi(\alpha+\beta) ,~ \operatorname{E}[ ln(1-X) ]=\psi(\beta)-\psi(\alpha+\beta) [0,1] f(x)=
\operatorname{E}[ ln(1+X2) ]=2ln2 (-infty,infty) f(x)=
xk-1\exp\left(-
\right) \operatorname{E}[ X2 ]=k ,~ \operatorname{E}[ lnX ]=
\right)+ln(2)\right] [0,infty) f(x)=
\right) \operatorname{E}[ X ]=k ,~ \operatorname{E}[ lnX ]=\psi\left(
\right)+ln(2) [0,infty) f(x)=
xk-1\exp(-λx) \operatorname{E}[ X ]=k/λ ,~ \operatorname{E}[ lnX ]=\psi(k)-ln(λ) [0,infty) f(x)=
\operatorname{E}[ X ]=k \theta ,~ \operatorname{E}[ lnX ]=\psi(k)+ln\theta [0,infty) f(x)=
}\exp\left(-
\right) \operatorname{E}[ lnX ]=\mu ,~ \operatorname{E}[ ln(X)2 ]=\sigma2+\mu2 (0,infty) f(x)=
\sqrt{
} x2\exp\left(-
\right) \operatorname{E}[ X2 ]=3a2 ,~ \operatorname{E}[ lnX ]=1+ln\left(
[0,infty) f(x)=
xk-1 \exp\left(-
\right) \operatorname{E}[ Xk ]=λk ,~ \operatorname{E}[ lnX ]=ln(λ)-
[0,infty) fX(\vec{x})=
\vec{\mu}\right]\top\Sigma-1\left[\vec{x}-\vec{\mu}\right]\right) }{ \sqrt{(2\pi)Nl|\Sigmar| } } \operatorname{E}[ \vec{X} ]=\vec{\mu} ,~ \operatorname{E}[ (\vec{X}-\vec{\mu})(\vec{X}-\vec{\mu})\top ]=\Sigma Rn f(k)={n\choosek}pk(1-p)n-k \operatorname{E}[ X ]=\mu ,~ f\in \left\{0,{\ldots},n\right\} f(k)=
\operatorname{E}[ X ]=λ ,~ f\ininfty N=\left\{0,1,{\ldots}\right\} f(x)=
=
\operatorname{E}[ X ]=0 ,~ \operatorname{E}[ ln\left(1+e-X\right) ]=1 \left\{-infty,infty\right\} The maximum entropy principle can be used to upper bound the entropy of statistical mixtures.[12] See also
References
|