Kernel density estimation explained

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.[1] [2] One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.[3]

Definition

Let (x1, x2, ..., xn) be independent and identically distributed samples drawn from some univariate distribution with an unknown density ƒ at any given point x. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is

\widehat{f}h(x)=

1
n
n
\sum
i=1

Kh(x-xi)=

1
nh
n
\sumK(
i=1
x-xi
h

),

where K is the kernel — a non-negative function — and is a smoothing parameter called the bandwidth or simply width. A kernel with subscript h is called the scaled kernel and defined as . Intuitively one wants to choose h as small as the data will allow; however, there is always a trade-off between the bias of the estimator and its variance. The choice of bandwidth is discussed in more detail below.

A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov (parabolic), normal, and others. The Epanechnikov kernel is optimal in a mean square error sense,[4] though the loss of efficiency is small for the kernels listed previously.[5] Due to its convenient mathematical properties, the normal kernel is often used, which means, where ϕ is the standard normal density function.

The construction of a kernel density estimate finds interpretations in fields outside of density estimation.[6] For example, in thermodynamics, this is equivalent to the amount of heat generated when heat kernels (the fundamental solution to the heat equation) are placed at each data point locations xi. Similar methods are used to construct discrete Laplace operators on point clouds for manifold learning (e.g. diffusion map).

Example

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. The diagram below based on these 6 data points illustrates this relationship:

Sample 1 2 3 4 5 6
Value −2.1 −1.3 −0.4 1.9 5.1 6.2

For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data: In this case, six bins each of width 2. Whenever a data point falls inside this interval, a box of height 1/12 is placed there. If more than one data point falls inside the same bin, the boxes are stacked on top of each other.

For the kernel density estimate, normal kernels with a standard deviation of 1.5 (indicated by the red dashed lines) are placed on each of the data points xi. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate (compared to the discreteness of the histogram) illustrates how kernel density estimates converge faster to the true underlying density for continuous random variables.[7]

Bandwidth selection

The bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated random sample from the standard normal distribution (plotted at the blue spikes in the rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth h = 0.05, which is too small. The green curve is oversmoothed since using the bandwidth h = 2 obscures much of the underlying structure. The black curve with a bandwidth of h = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density. An extreme situation is encountered in the limit

h\to0

(no smoothing), where the estimate is a sum of n delta functions centered at the coordinates of analyzed samples. In the other extreme limit

h\toinfty

the estimate retains the shape of the used kernel, centered on the mean of the samples (completely smooth).

The most common optimality criterion used to select this parameter is the expected L2 risk function, also termed the mean integrated squared error:

\operatorname{MISE}(h)=\operatorname{E}\left[\int(\hat{f}h(x)-f(x))2dx\right]

Under weak assumptions on ƒ and K, (ƒ is the, generally unknown, real density function),[1] [2]

\operatorname{MISE}(h)=\operatorname{AMISE}(h)+l{o}((nh)-1+h4)

where o is the little o notation, and n the sample size (as above). The AMISE is the asymptotic MISE, i. e. the two leading terms,

\operatorname{AMISE}(h)=

R(K)
nh

+

1
4
2
m
2(K)

h4R(f'')

where

R(g)=\intg(x)2dx

for a function g,

m2(K)=\intx2K(x)dx

and

f''

is the second derivative of

f

and

K

is the kernel. The minimum of this AMISE is the solution to this differential equation
\partial
\partialh

\operatorname{AMISE}(h)=-

R(K)
nh2

+

2
m
2(K)

h3R(f'')=0

or

h\operatorname{AMISE

} = \frac n^ = C n^

Neither the AMISE nor the hAMISE formulas can be used directly since they involve the unknown density function

f

or its second derivative

f''

. To overcome that difficulty, a variety of automatic, data-based methods have been developed to select the bandwidth. Several review studies have been undertaken to compare their efficacies,[8] [9] [10] [11] [12] [13] [14] with the general consensus that the plug-in selectors[6] [15] [16] and cross validation selectors[17] [18] [19] are the most useful over a wide range of data sets.

Substituting any bandwidth h which has the same asymptotic order n−1/5 as hAMISE into the AMISEgives that AMISE(h) = O(n−4/5), where O is the big O notation. It can be shown that, under weak assumptions, there cannot exist a non-parametric estimator that converges at a faster rate than the kernel estimator.[20] Note that the n−4/5 rate is slower than the typical n−1 convergence rate of parametric methods.

If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (pointwise estimator), this produces a particularly powerful method termed adaptive or variable bandwidth kernel density estimation.

Bandwidth selection for kernel density estimation of heavy-tailed distributions is relatively difficult.[21]

A rule-of-thumb bandwidth estimator

If Gaussian basis functions are used to approximate univariate data, and the underlying density being estimated is Gaussian, the optimal choice for h (that is, the bandwidth that minimises the mean integrated squared error) is:[22]

h=\left(

4\hat{\sigma
5}{3n}\right)
1
5

1.06\hat{\sigma}n-1/5,

An

h

value is considered more robust when it improves the fit for long-tailed and skewed distributions or for bimodal mixture distributions. This is often done empirically by replacing the standard deviation

\hat{\sigma}

by the parameter

A

below:

A=min\left(\hat{\sigma},

IQR
1.34

\right)

where IQR is the interquartile range.Another modification that will improve the model is to reduce the factor from 1.06 to 0.9. Then the final formula would be:

h=0.9min\left(\hat{\sigma},

IQR
1.34

\right)

-1
5
n
where

n

is the sample size.

This approximation is termed the normal distribution approximation, Gaussian approximation, or Silverman's rule of thumb. While this rule of thumb is easy to compute, it should be used with caution as it can yield widely inaccurate estimates when the density is not close to being normal. For example, when estimating the bimodal Gaussian mixture model

1
2\sqrt{2\pi
}e^+\frace^from a sample of 200 points, the figure on the right shows the true density and two kernel density estimates — one using the rule-of-thumb bandwidth, and the other using a solve-the-equation bandwidth. The estimate based on the rule-of-thumb bandwidth is significantly oversmoothed.

Relation to the characteristic function density estimator

Given the sample (x1, x2, ..., xn), it is natural to estimate the characteristic function as

\widehat\varphi(t)=

1
n
n
\sum
j=1
itxj
e

Knowing the characteristic function, it is possible to find the corresponding probability density function through the Fourier transform formula. One difficulty with applying this inversion formula is that it leads to a diverging integral, since the estimate \scriptstyle\widehat\varphi(t) is unreliable for large t’s. To circumvent this problem, the estimator \scriptstyle\widehat\varphi(t) is multiplied by a damping function, which is equal to 1 at the origin and then falls to 0 at infinity. The “bandwidth parameter” h controls how fast we try to dampen the function \scriptstyle\widehat\varphi(t). In particular when h is small, then ψh(t) will be approximately one for a large range of t’s, which means that \scriptstyle\widehat\varphi(t) remains practically unaltered in the most important region of t’s.

The most common choice for function ψ is either the uniform function, which effectively means truncating the interval of integration in the inversion formula to, or the Gaussian function . Once the function ψ has been chosen, the inversion formula may be applied, and the density estimator will be

\begin{align} \widehat{f}(x)&=

1
2\pi
+infty
\int
-infty

\widehat\varphi(t)\psih(t)e-itxdt =

1
2\pi
+infty
\int
-infty
1
n
n
\sum
j=1
it(xj-x)
e

\psi(ht)dt\\[5pt] &=

1
nh
n
\sum
j=1
1
2\pi
+infty
\int
-infty
-i(ht)x-xj
h
e

\psi(ht)d(ht) =

1
nh
n
\sumK(
j=1
x-xj
h

), \end{align}

where K is the Fourier transform of the damping function ψ. Thus the kernel density estimator coincides with the characteristic function density estimator.

Geometric and topological features

We can extend the definition of the (global) mode to a local sense and define the local modes:

M=\{x:g(x)=0,λ1(x)<0\}

Namely,

M

is the collection of points for which the density function is locally maximized. A natural estimator of

M

is a plug-in from KDE,[23] [24] where

g(x)

and

λ1(x)

are KDE version of

g(x)

and

λ1(x)

. Under mild assumptions,

Mc

is a consistent estimator of

M

. Note that one can use the mean shift algorithm[25] [26] [27] to compute the estimator

Mc

numerically.

Statistical implementation

A non-exhaustive list of software implementations of kernel density estimators includes:

SE(3)

.

See also

Further reading

External links

Notes and References

  1. Rosenblatt . M. . Murray Rosenblatt. Remarks on Some Nonparametric Estimates of a Density Function . 10.1214/aoms/1177728190 . The Annals of Mathematical Statistics . 27 . 3 . 832–837 . 1956 . free .
  2. Parzen . E. . Emanuel Parzen. On Estimation of a Probability Density Function and Mode . 10.1214/aoms/1177704472 . The Annals of Mathematical Statistics. 33 . 3 . 1065–1076 . 1962 . 2237880. free .
  3. Book: Hastie . Trevor . Trevor Hastie . The Elements of Statistical Learning : Data Mining, Inference, and Prediction : with 200 full-color illustrations . Tibshirani . Robert . Robert Tibshirani . Friedman . Jerome H. . Jerome H. Friedman . 2001 . Springer . 0-387-95284-5 . New York . 46809224.
  4. 10.1137/1114019 . Epanechnikov, V.A. . Non-parametric estimation of a multivariate probability density . Theory of Probability and Its Applications . 14 . 153–158 . 1969.
  5. Book: Wand, M.P . Jones, M.C. . Kernel Smoothing . Chapman & Hall/CRC . London . 1995 . 978-0-412-55270-0.
  6. Zdravko . Botev . Nonparametric Density Estimation via Diffusion Mixing . University of Queensland . 2007 .
  7. Scott, D. . On optimal and data-based histograms . Biometrika . 1979 . 66 . 605–610 . 10.1093/biomet/66.3.605 . 3.
  8. Park, B.U. . Marron, J.S. . 1990 . Comparison of data-driven bandwidth selectors . Journal of the American Statistical Association . 85 . 409 . 66–72 . 2289526 . 10.1080/01621459.1990.10475307. 10.1.1.154.7321 .
  9. Park, B.U. . Turlach, B.A. . 1992 . Practical performance of several data driven bandwidth selectors (with discussion) . Computational Statistics . 7 . 251–270.
  10. Cao, R. . Cuevas, A. . Manteiga, W. G. . 1994 . A comparative study of several smoothing methods in density estimation . Computational Statistics and Data Analysis . 17 . 153–176 . 10.1016/0167-9473(92)00066-Z. 2.
  11. 10.2307/2291420 . Jones, M.C. . Marron, J.S. . Sheather, S. J. . 1996 . A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association . 91 . 433 . 401–407 . 2291420.
  12. Sheather, S.J. . 1992 . The performance of six popular bandwidth selection methods on some real data sets (with discussion) . Computational Statistics . 7 . 225–250, 271–281.
  13. Agarwal, N. . Aluru, N.R. . 2010 . A data-driven stochastic collocation approach for uncertainty quantification in MEMS . International Journal for Numerical Methods in Engineering . 83 . 5 . 575–597 . 10.1002/nme.2844 . 2010IJNME..83..575A . 84834908 .
  14. Xu, X. . Yan, Z. . Xu, S. . 2015 . Estimating wind speed probability distribution by diffusion-based kernel density method . Electric Power Systems Research. 121 . 28–37 . 10.1016/j.epsr.2014.11.029 . 2015EPSR..121...28X .
  15. Botev, Z.I. . Grotowski, J.F. . Kroese, D.P. . Kernel density estimation via diffusion . . 38 . 5 . 2916–2957 . 2010 . 10.1214/10-AOS799. 1011.2602 . 41350591 .
  16. Sheather, S.J. . Jones, M.C. . 1991 . A reliable data-based bandwidth selection method for kernel density estimation . Journal of the Royal Statistical Society, Series B . 53 . 3 . 683–690 . 2345597 . 10.1111/j.2517-6161.1991.tb01857.x.
  17. Rudemo, M. . 1982 . Empirical choice of histograms and kernel density estimators . Scandinavian Journal of Statistics . 9 . 2 . 65–78 . 4615859.
  18. Bowman, A.W. . 1984 . An alternative method of cross-validation for the smoothing of density estimates . Biometrika . 71 . 353–360 . 10.1093/biomet/71.2.353 . 2.
  19. Hall, P. . Marron, J.S. . Park, B.U. . 1992 . Smoothed cross-validation . Probability Theory and Related Fields . 92 . 1–20 . 10.1007/BF01205233. 121181481 . free .
  20. 10.1214/aos/1176342997. Wahba. G.. Optimal convergence properties of variable knot, kernel, and orthogonal series methods for density estimation. Annals of Statistics. 1975. 3. 1. 15–29. free.
  21. Buch-Larsen . TINE . Kernel density estimation for heavy-tailed distributions using the Champernowne transformation . 10.1080/02331880500439782 . Statistics . 39 . 6 . 503–518 . 2005 . 10.1.1.457.1544 . 219697435 .
  22. Book: Silverman, B.W. . Bernard Silverman . Density Estimation for Statistics and Data Analysis . Chapman & Hall/CRC . London . 1986 . 978-0-412-24620-3 . 45 . registration .
  23. Chen. Yen-Chi. Genovese. Christopher R.. Wasserman. Larry. 2016. A comprehensive approach to mode clustering. Electronic Journal of Statistics. 10. 1. 210–241. 10.1214/15-ejs1102. 1935-7524. free. 1406.1780.
  24. Book: Chazal. Frédéric. Fasy. Brittany Terese. Lecci. Fabrizio. Rinaldo. Alessandro. Wasserman. Larry. Proceedings of the thirtieth annual symposium on Computational geometry . Stochastic Convergence of Persistence Landscapes and Silhouettes . 2014. 6. 2. 474–483. New York, New York, USA. ACM Press. 10.1145/2582112.2582128. 978-1-4503-2594-3. 6029340. https://jocg.org/index.php/jocg/article/view/2982.
  25. Fukunaga. K.. Hostetler. L.. January 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory. 21. 1. 32–40. 10.1109/tit.1975.1055330. 0018-9448.
  26. Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 17. 8. 790–799. 10.1109/34.400568. 0162-8828. 10.1.1.510.1222.
  27. Comaniciu. D.. Meer. P.. May 2002. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24. 5. 603–619. 10.1109/34.1000236. 691081 . 0162-8828.
  28. Book: Janert, Philipp K . Gnuplot in action : understanding data with graphs . 2009 . Manning Publications . Connecticut, USA . 978-1-933988-39-9 . See section 13.2.2 entitled Kernel density estimates.
  29. Web site: Kernel smoothing function estimate for univariate and bivariate data - MATLAB ksdensity. 2020-11-05. www.mathworks.com.
  30. Book: Horová. I.. Koláček. J.. Zelinka. J.. Kernel Smoothing in MATLAB: Theory and Practice of Kernel Smoothing. 2012. World Scientific Publishing. Singapore. 978-981-4405-48-5.
  31. Web site: SmoothKernelDistribution—Wolfram Language Documentation. 2020-11-05. reference.wolfram.com.
  32. Web site: KernelMixtureDistribution—Wolfram Language Documentation. 2020-11-05. reference.wolfram.com.
  33. Web site: Software for calculating kernel densities. 2020-11-05. www.rsc.org.
  34. Web site: The Numerical Algorithms Group . NAG Library Routine Document: nagf_smooth_kerndens_gauss (g10baf) . NAG Library Manual, Mark 23 . 2012-02-16 .
  35. Web site: The Numerical Algorithms Group . NAG Library Routine Document: nag_kernel_density_estim (g10bac) . NAG Library Manual, Mark 9 . 2012-02-16 . https://web.archive.org/web/20111124062333/http://nag.co.uk/numeric/cl/nagdoc_cl09/pdf/G10/g10bac.pdf . 2011-11-24 . dead .
  36. Web site: Vanderplas. Jake. Kernel Density Estimation in Python. 2013-12-01. 2014-03-12.
  37. Web site: seaborn.kdeplot — seaborn 0.10.1 documentation. seaborn.pydata.org. 2020-05-12.
  38. Web site: Kde-gpu: We implemented nadaraya waston kernel density and kernel conditional probability estimator using cuda through cupy. It is much faster than cpu version but it requires GPU with high memory.
  39. Web site: Basic Statistics - RDD-based API - Spark 3.0.1 Documentation. 2020-11-05. spark.apache.org.
  40. Web site: kdensity — Univariate kernel density estimation. Stata 15 manual.