In various science/engineering applications, such as independent component analysis,[1] image analysis,[2] genetic analysis,[3] speech recognition,[4] manifold learning,[5] and time delay estimation[6] it is useful to estimate the differential entropy of a system or process, given some observations.
The simplest and most common approach uses histogram-based estimation, but other approaches have been developed and used, each with its own benefits and drawbacks.[7] The main factor in choosing a method is often a trade-off between the bias and the variance of the estimate,[8] although the nature of the (suspected) distribution of the data may also be a factor,[7] as well as the sample size and the size of the alphabet of the probability distribution.[9]
The histogram approach uses the idea that the differential entropy of a probability distribution
f(x)
x
h(X)=-\intXf(x)logf(x)dx
can be approximated by first approximating
f(x)
x
H(X)=-
nf(x | ||
\sum | \left( | |
i)log |
f(xi) | |
w(xi) |
\right)
with bin probabilities given by that histogram. The histogram is itself a maximum-likelihood (ML) estimate of the discretized frequency distribution), where
w
i
A method better suited for multidimensional probability density functions (pdf) is to first make a pdf estimate with some method, and then, from the pdf estimate, compute the entropy. A useful pdf estimate method is e.g. Gaussian mixture modeling (GMM), where the expectation maximization (EM) algorithm is used to find an ML estimate of a weighted sum of Gaussian pdf's approximating the data pdf.
If the data is one-dimensional, we can imagine taking all the observations and putting them in order of their value. The spacing between one value and the next then gives us a rough idea of (the reciprocal of) the probability density in that region: the closer together the values are, the higher the probability density. This is a very rough estimate with high variance, but can be improved, for example by thinking about the space between a given value and the one m away from it, where m is some fixed number.[7]
The probability density estimated in this way can then be used to calculate the entropy estimate, in a similar way to that given above for the histogram, but with some slight tweaks.
One of the main drawbacks with this approach is going beyond one dimension: the idea of lining the data points up in order falls apart in more than one dimension. However, using analogous methods, some multidimensional entropy estimators have been developed.[11] [12]
For each point in our dataset, we can find the distance to its nearest neighbour. We can in fact estimate the entropy from the distribution of the nearest-neighbour-distance of our datapoints.[7] (In a uniform distribution these distances all tend to be fairly similar, whereas in a strongly nonuniform distribution they may vary a lot more.)
When in under-sampled regime, having a prior on the distribution can help the estimation. One such Bayesian estimator was proposed in the neuroscience context known as the NSB (Nemenman–Shafee–Bialek) estimator.[13] [14] The NSB estimator uses a mixture of Dirichlet prior, chosen such that the induced prior over the entropy is approximately uniform.
A new approach to the problem of entropy evaluation is to compare the expected entropy of a sample of random sequence with the calculated entropy of the sample. The method gives very accurate results, but it is limited to calculations of random sequences modeled as Markov chains of the first order with small values of bias and correlations. This is the first known method that takes into account the size of the sample sequence and its impact on the accuracy of the calculation of entropy.[15] [16]
A deep neural network (DNN) can be used to estimate the joint entropy and called Neural Joint Entropy Estimator (NJEE).[17] Practically, the DNN is trained as a classifier that maps an input vector or matrix X to an output probability distribution over the possible classes of random variable Y, given input X. For example, in an image classification task, the NJEE maps a vector of pixel values to probabilities over possible image classes. In practice, the probability distribution of Y is obtained by a Softmax layer with number of nodes that is equal to the alphabet size of Y. NJEE uses continuously differentiable activation functions, such that the conditions for the universal approximation theorem holds. It is shown that this method provides a strongly consistent estimator and outperforms other methods in case of large alphabet sizes.[9]