In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.[1]
Simply speaking, it is a measure of the fractal dimension of a probability distribution. It characterizes the growth rate of the Shannon entropy given by successively finer discretizations of the space.
In 2010, Wu and Verdú gave an operational characterization of Rényi information dimension as the fundamental limit of almost lossless data compression for analog sources under various regularity constraints of the encoder/decoder.
Z
H0(Z)=\sum
z\insupp(PZ) |
PZ(z)log
|
where
PZ(z)
Z
Z=z
supp(PZ)
\{z|z\inl{Z},PZ(z)>0\}
Let
X
m
\langle
X\rangle | ||||
|
where the
\lfloor ⋅ \rfloor
\underline{d}(X)=\liminfm
H0(\langleX\ranglem) | |
log2m |
and
\bar{d}(X)=\limsupm
H0(\langleX\ranglem) | |
log2m |
are called lower and upper information dimensions of
X
\underline{d}(X)=\bar{d}(X)
X
d(X)=\limm
H0(\langleX\ranglem) | |
log2m |
Some important properties of information dimension
d(X)
H(\lfloorX\rfloor)<infty
0\leq\underline{d}(X)\leq\bar{d}(X)\leq1
n
\vec{X}
0\leq\underline{d}(\vec{X})\leq\bar{d}(\vec{X})\leqn
m=2l
\underline{d}(X)
\bar{d}(X)
If the information dimension
d
d
Hd(X)(X)=\limn(H0(\langleX\ranglen)-d(X)log2n)
provided the limit exists. If
d=0
H0(X)
d=n\ge1
n
n
In 1994, Kawabata and Dembo in proposed a new way of measuring information based on rate distortion value of a random variable. The measure is defined as
d | ||||
|
,
R(X,D)
R(X,D)=min\|X-\hat{X\|2\leqD}I(X,\hat{X}),
D
X
They further, proved that such definition is equivalent to the definition of information dimension. Formally,
dR(X)=d(X).
Using the above definition of Rényi information dimension, a similar measure to d-dimensional entropy is defined in . This value
b(X)
R(X,D)=-
d(X) | |
2 |
log
2\pieD | |
d(X) |
+b(X).
The dimensional-rate bias is equal to d-dimensional rate for continuous, discrete, and discrete-continuous mixed distribution. Furthermore, it is calculable for a set of singular random variables, while d-dimensional entropy does not necessarily exist there.
Finally, dimensional-rate bias generalizes the Shannon's entropy and differential entropy, as one could find the mutual information
I(X;Y)
I(X;Y)=b(X)+b(Y)-b(X,Y).
According to Lebesgue decomposition theorem,[2] a probability distribution can be uniquely represented by the mixture
wherev=pPXd+qPXc+rPXs
p+q+r=1
p,q,r\geq0
PXd
PXc
PXs
Let
X
H(\lfloorX\rfloor)<infty
X
wherev=(1-\rho)PXd+\rhoPXc
PXd
PXc
0\leq\rho\leq1
Moreover, givend(X)=\rho
H0(PXd)
h(PXc)
d
whereH\rho(X)=(1-\rho)H0(PXd)+\rhoh(PXc)+H0(\rho)
H0(\rho)
Z
PZ(1)=\rho
PZ(0)=1-\rho
H0(\rho)=\rholog
2 1 \rho
+(1-\rho)log
2 1 1-\rho
Consider a signal which has a Gaussian probability distribution.
We pass the signal through a half-wave rectifier which converts all negative value to 0, and maintains all other values. The half-wave rectifier can be characterized by the function
Then, at the output of the rectifier, the signal has a rectified Gaussian distribution. It is characterized by an atomic mass of weight 0.5 and has a Gaussian PDF for allf(x)=\begin{cases} x,&ifx\geq0\\ 0,&x<0 \end{cases}
x>0
With this mixture distribution, we apply the formula above and get the information dimension
d
d
The normalized right part of the zero-mean Gaussian distribution has entropyd(X)=\rho=0.5
h(PXc)=
1 | |
2 |
log2(2\pie\sigma2)-1
\begin{align} H0.5(X)&=(1-0.5)(1log21)+0.5h(PXc
)+H (
0(0.5)\\ &=0+ 1 2
1 2 log2(2\pi
2)-1)+1\\ &= 1 4 e\sigma log2(2\pi
2)+ 1 2 e\sigma bit(s) \end{align}
It is shown [3] that information dimension and differential entropy are tightly connected.
Let
X
f(x)
X
\Delta
xi
f(xi)\Delta=\int
(i+1)\Delta | |
i\Delta |
f(x) dx
Consider the discretized random variable
\Delta=x | |
X | |
i |
i\Delta\leqX<(i+1)\Delta
\Delta=x | |
X | |
i |
P | |
X\Delta |
(xi)=\int
(i+1)\Delta | |
i\Delta |
f(x) dx=f(xi)\Delta
Let
S=
\operatorname{supp}(P | |
X\Delta |
)
X\Delta
\Delta)&=-\sum | |
\begin{align} H | |
xi\inS |
P | |
X\Delta |
log2P
X\Delta |
\\ &=-\sum | |
xi\inS |
f(xi)\Deltalog2(f(xi)\Delta)\\ &=-\sum
xi\inS |
\Deltaf(xi)log2f(xi)-\sum
xi\inS |
f(xi)\Deltalog2\Delta\\ &=-\sum
xi\inS |
\Deltaf(xi)log2f(xi)-log2\Delta\\ \end{align}
If we set
\Delta=1/m
xi=i/m
1/m | |
H | |
0(X |
)=H0(\langleX\ranglem).
This yields
H0(\langleX\ranglem)=-\sum
1 | |
m |
f(xi)log2f(xi)+log2m
and when
m
-\sum\Deltaf(xi)log2f(xi) ≈ \intf(x)log2
1 | |
f(x) |
dx
which is the differential entropy
h(x)
f(x)
h(X)=\limm → H0(\langleX\ranglem)-log2(m).
Comparing this with the
d
h(X)=H1(X).
In fact, this can be generalized to higher dimensions. Rényi shows that, if
\vec{X}
n
\realn
f\vec{X
H0(\langle\vec{X}\ranglem)<infty
d(\vec{X})=n
and
Hn(\vec{X})=\int … \intf\vec{X
if the integral exists.
The information dimension of a distribution gives a theoretical upper bound on the compression rate, if one wants to compress a variable coming from this distribution. In the context of lossless data compression, we try to compress real number with less real number which both have infinite precision.
The main objective of the lossless data compression is to find efficient representations for source realizations
xn\inl{X}n
yn\inl{Y}n
(n,k)-
\{Xi:i\inl{N}\}
n → | |
f | |
n:l{X} |
l{Y}k
k → l{X} | |
g | |
n:l{Y} |
n
l{P}\{gn(f
n)) ≠ | |
n(X |
Xn\}
Define
r(\epsilon)
r\geq0
(n,\lfloorrn\rfloor)-
l{P}\{gn(f
n)) ≠ | |
n(X |
Xn\}\leq\epsilon
n
So
r(\epsilon)
Consider a continuous encoder function
f(x):{R
g(x):{R
f(x)
g(x)
\real
\epsilon
R0(\epsilon)=0
0<\epsilon\leq1
In order to get some nontrivial and meaningful conclusions, let
R*(\epsilon)
\epsilon-
X
R*(\epsilon)=d(X)
0<\epsilon\leq1
\bar{d}(X)<infty
\epsilon-
R(\epsilon)\geq\bar{d}(X)
0<\epsilon\leq1
The fundamental role of information dimension in lossless data compression further extends beyond the i.i.d. data. It is shown that for specified processes (e.g., moving-average processes) the ratio of lossless compression is also equal to the information dimension rate.[5] This result allows for further compression that was not possible by considering only marginal distribution of the process.