In information geometry, the Fisher information metric[1] is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.
The metric is interesting in several aspects. By Chentsov’s theorem, the Fisher information metric on statistical models is the only Riemannian metric (up to rescaling) that is invariant under sufficient statistics.[2] [3]
It can also be understood to be the infinitesimal form of the relative entropy (i.e., the Kullback–Leibler divergence); specifically, it is the Hessian of the divergence. Alternately, it can be understood as the metric induced by the flat space Euclidean metric, after appropriate changes of variable. When extended to complex projective Hilbert space, it becomes the Fubini–Study metric; when written in terms of mixed states, it is the quantum Bures metric.
Considered purely as a matrix, it is known as the Fisher information matrix. Considered as a measurement technique, where it is used to estimate hidden parameters in terms of observed random variables, it is known as the observed information.
Given a statistical manifold with coordinates
\theta=(\theta1,\theta2,\ldots,\thetan)
p(x\mid\theta)
\theta
x
x
\theta
\intRp(x\mid\theta)dx=1
The Fisher information metric then takes the form:
gjk(\theta) = -\intR
\partial2logp(x\mid\theta) | |
\partial\thetaj\partial\thetak |
p(x\mid\theta)dx.
The integral is performed over all values x in R. The variable
\theta
When the probability is derived from the Gibbs measure, as it would be for any Markovian process, then
\theta
Substituting
i(x\mid\theta)=-log{}p(x\mid\theta)
gjk(\theta) = \intR
\partial2i(x\mid\theta) | |
\partial\thetaj\partial\thetak |
p(x\mid\theta)dx = E \left[
\partial2i(x\mid\theta) | |
\partial\thetaj\partial\thetak |
\right].
To show that the equivalent form equals the above definition note that
E \left[
\partiallog{ | |
p(x |
\mid\theta)}{\partial\thetaj} \right]=0
and apply
\partial | |
\partial\thetak |
The Fisher information metric is particularly simple for the exponential family, which has The metric is The metric has a particularly simple form if we are using the natural parameters. In this case,
η(\theta)=\theta
2 | |
\nabla | |
\theta |
A
lN(\mu,\Sigma)
T=\Sigma-1
The metric splits to a mean part and a precision/variance part, because
g\mu,=0
g | |
\mui,\muj |
=Tij
gT,T=-
12 | |
\nabla |
2 | |
T |
ln\detT
In particular, for single variable normal distribution,
g=\begin{bmatrix}t&0\ 0&(2t2)-1\end{bmatrix}=\sigma-2\begin{bmatrix}1&0\ 0&2\end{bmatrix}
x=\mu/\sqrt2,y=\sigma
ds2=2
dx2+dy2 | |
y2 |
The shortest paths (geodesics) between two univariate normal distributions are either parallel to the
\sigma
\mu/\sqrt2
The geodesic connecting
\delta | |
\mu0 |
,
\delta | |
\mu1 |
\sigma=
\mu1-\mu0 | |
2\sqrt2 |
s=\sqrt2ln\tan(\phi/2)
Alternatively, the metric can be obtained as the second derivative of the relative entropy or Kullback–Leibler divergence.[4] To obtain this, one considers two probability distributions
P(\theta)
P(\theta0)
P(\theta)=P(\theta0)+\sumj\Delta\thetaj\left.
\partialP | |
\partial\thetaj |
\right| | |
\theta0 |
with
\Delta\thetaj
\theta
DKL[P(\theta0)\|P(\theta)]
P(\theta)=P(\theta0)
\theta=\theta0
f | |
\theta0 |
(\theta):=DKL[P(\theta0)\|P(\theta)]=
1 | |
2 |
\sumjk\Delta\thetaj\Delta\thetakgjk(\theta0)+O(\Delta\theta3)
The symmetric matrix
gjk
f | |
\theta0 |
(\theta)
\theta0
The Ruppeiner metric and Weinhold metric are the Fisher information metric calculated for Gibbs distributions as the ones found in equilibrium statistical mechanics.[5] [6]
The action of a curve on a Riemannian manifold is given by
A= | 1 |
2 |
b | |
\int | |
a |
\partial\thetaj | |
\partialt |
gjk(\theta)
\partial\thetak | |
\partialt |
dt
The path parameter here is time t; this action can be understood to give the change in free entropy of a system as it is moved from time a to time b.[6] Specifically, one has
\DeltaS=(b-a)A
as the change in free entropy. This observation has resulted in practical applications in chemical and processing industry: in order to minimize the change in free entropy of a system, one should follow the minimum geodesic path between the desired endpoints of the process. The geodesic minimizes the entropy, due to the Cauchy–Schwarz inequality, which states that the action is bounded below by the length of the curve, squared.
The Fisher metric also allows the action and the curve length to be related to the Jensen–Shannon divergence.[6] Specifically, one has
b | |
(b-a)\int | |
a |
\partial\thetaj | |
\partialt |
gjk
\partial\thetak | |
\partialt |
dt
b | |
= 8\int | |
a |
dJSD
where the integrand dJSD is understood to be the infinitesimal change in the Jensen–Shannon divergence along the path taken. Similarly, for the curve length, one has
b | ||
\int | \sqrt{ | |
a |
\partial\thetaj | |
\partialt |
gjk
\partial\thetak | |
\partialt |
That is, the square root of the Jensen–Shannon divergence is just the Fisher metric (divided by the square root of 8).
For a discrete probability space, that is, a probability space on a finite set of objects, the Fisher metric can be understood to simply be the Euclidean metric restricted to a positive orthant (e.g. "quadrant" in
R2
Consider a flat, Euclidean space, of dimension, parametrized by points
y=(y0, … ,yn)
N | |
h=\sum | |
i=0 |
dyi dyi
where the
styledyi
style | \partial |
\partialyj |
dy | ||||
|
\right)=\deltajk
the Euclidean metric may be written as
flat | |
h | |
jk |
=h\left(
\partial | |
\partialyj |
,
\partial | |
\partialyk |
\right)=\deltajk
The superscript 'flat' is there to remind that, when written in coordinate form, this metric is with respect to the flat-space coordinate
y
An N-dimensional unit sphere embedded in (N + 1)-dimensional Euclidean space may be defined as
N | |
\sum | |
i=0 |
2 | |
y | |
i |
=1
This embedding induces a metric on the sphere, it is inherited directly from the Euclidean metric on the ambient space. It takes exactly the same form as the above, taking care to ensure that the coordinates are constrained to lie on the surface of the sphere. This can be done, e.g. with the technique of Lagrange multipliers.
Consider now the change of variable
pi=y
2 | |
i |
\sumipi=1
while the metric becomes
\begin{align}h&=\sumidyi dyi =\sumid\sqrt{pi} d\sqrt{pi}\\ &=
1 | |
4 |
\sumi
dpi dpi | |
pi |
=
1 | |
4 |
\sumipi d(logpi) d(logpi) \end{align}
The last can be recognized as one-fourth of the Fisher information metric. To complete the process, recall that the probabilities are parametric functions of the manifold variables
\theta
pi=pi(\theta)
\begin{align}h &=
1 | |
4 |
\sumipi(\theta) d(logpi(\theta)) d(logpi(\theta))\\ &=
1 | |
4 |
\sumjk\sumipi(\theta)
\partiallogpi(\theta) | |
\partial\thetaj |
\partiallogpi(\theta) | |
\partial\thetak |
d\thetajd\thetak \end{align}
or, in coordinate form, the Fisher information metric is:
\begin{align} gjk(\theta) =
fisher &= | |
4h | |
jk |
4h\left(
\partial | , | |
\partial\thetaj |
\partial | |
\partial\thetak |
\right)\\ &=\sumipi(\theta)
\partiallogpi(\theta) | ||
\partial\thetaj |
\partiallogpi(\theta) | |
\partial\thetak |
\\ &=E\left[
\partiallogpi(\theta) | ||
\partial\thetaj |
\partiallogpi(\theta) | |
\partial\thetak |
\right] \end{align}
where, as before,
d\theta | ||||
|
\right)=\deltajk.
The superscript 'fisher' is present to remind that this expression is applicable for the coordinates
\theta
When the random variable
p
The above manipulations deriving the Fisher metric from the Euclidean metric can be extended to complex projective Hilbert spaces. In this case, one obtains the Fubini–Study metric.[8] This should perhaps be no surprise, as the Fubini–Study metric provides the means of measuring information in quantum mechanics. The Bures metric, also known as the Helstrom metric, is identical to the Fubini–Study metric,[8] although the latter is usually written in terms of pure states, as below, whereas the Bures metric is written for mixed states. By setting the phase of the complex coordinate to zero, one obtains exactly one-fourth of the Fisher information metric, exactly as above.
One begins with the same trick, of constructing a probability amplitude, written in polar coordinates, so:
\psi(x;\theta)=\sqrt{p(x;\theta)} ei\alpha(x;\theta)
Here,
\psi(x;\theta)
p(x;\theta)
\alpha(x;\theta)
\alpha(x;\theta)=0
\intXp(x;\theta)dx=1
is equivalently expressed by the idea the square amplitude be normalized:
\intX\vert\psi(x;\theta)\vert2dx=1
When
\psi(x;\theta)
The Fubini–Study metric, written in infinitesimal form, using quantum-mechanical bra–ket notation, is
ds2=
\langle\delta\psi\mid\delta\psi\rangle | |
\langle\psi\mid\psi\rangle |
-
\langle\delta\psi\mid\psi\rangle \langle\psi\mid\delta\psi\rangle | |
{\langle\psi\mid\psi\rangle |
2}.
In this notation, one has that
\langlex\mid\psi\rangle=\psi(x;\theta)
\langle\phi\mid\psi\rangle=\intX\phi*(x;\theta)\psi(x;\theta)dx.
The expression
\vert\delta\psi\rangle
\delta\psi=\left(
\deltap | |
2p |
+i\delta\alpha\right)\psi
Inserting the above into the Fubini–Study metric gives:
\begin{align}ds2={}&
1 | |
4 |
\intX(\deltalogp)2 pdx\\[8pt] {}&+\intX(\delta\alpha)2 pdx -\left(\intX\delta\alpha pdx\right)2\\[8pt] &{}+
i | |
2 |
\intX(\deltalogp\delta\alpha-\delta\alpha\deltalogp) pdx\end{align}
Setting
\delta\alpha=0
\delta\tod
ds2\toh
\begin{align}h={}&
1 | |
4 |
E\left[(dlogp)2\right]+E\left[(d\alpha)2\right] -\left(E\left[d\alpha\right]\right)2\\[8pt] {}&+
i | |
2 |
E\left[dlogp\wedged\alpha\right] \end{align}
The imaginary term is a symplectic form, it is the Berry phase or geometric phase. In index notation, the metric is:
\begin{align}hjk={}& h\left(
\partial | |
\partial\thetaj |
,
\partial | |
\partial\thetak |
\right)\\[8pt] ={}&
1 | E\left[ | |
4 |
\partiallogp | |
\partial\thetaj |
\partiallogp | |
\partial\thetak |
\right]\\[8pt] {}&+E\left[
\partial\alpha | |
\partial\thetaj |
\partial\alpha | |
\partial\thetak |
\right] -E\left[
\partial\alpha | |
\partial\thetaj |
\right] E\left[
\partial\alpha | |
\partial\thetak |
\right]\\[8pt] &{}+
i | E\left[ | |
2 |
\partiallogp | |
\partial\thetaj |
\partial\alpha | |
\partial\thetak |
-
\partial\alpha | |
\partial\thetaj |
\partiallogp | |
\partial\thetak |
\right] \end{align}
Again, the first term can be clearly seen to be (one fourth of) the Fisher information metric, by setting
\alpha=0
A slightly more formal, abstract definition can be given, as follows.[9]
Let X be an orientable manifold, and let
(X,\Sigma,\mu)
(\Omega,l{F},P)
\Omega=X
l{F}=\Sigma
P=\mu
The statistical manifold S(X) of X is defined as the space of all measures
\mu
\Sigma
Pick a point
\mu\inS(X)
T\muS
g(\sigma1,\sigma2)=\intX
d\sigma1 | |
d\mu |
d\sigma2 | |
d\mu |
d\mu
Here,
\sigma1
\sigma2
\sigma1,\sigma2\inT\muS
\mu
In order for the integral to be well-defined, the space S(X) must have the Radon–Nikodym property, and more specifically, the tangent space is restricted to those vectors that are square-integrable. Square integrability is equivalent to saying that a Cauchy sequence converges to a finite value under the weak topology: the space contains its limit points. Note that Hilbert spaces possess this property.
This definition of the metric can be seen to be equivalent to the previous, in several steps. First, one selects a submanifold of S(X) by considering only those measures
\mu
\theta
\theta
\theta
With some additional abuse of language, one notes that the exponential map provides a map from vectors in a tangent space to points in an underlying manifold. Thus, if
\sigma\inT\muS
p=\exp(\sigma)
p\inS(X)
\mu
p\inS(X)
\mu