Hopkins statistic explained
The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.[2] If individuals are aggregated, then its value approaches 0, and if they are randomly distributed along the value tends to 0.5.[3]
Preliminaries
A typical formulation of the Hopkins statistic follows.[2]
Let
be the set of
data points.
Generate a random sample
of
data points sampled without replacement from
.
Generate a set
of
uniformly randomly distributed data points.
Define two distance measures,
the minimum distance (given some suitable metric) of
to its nearest neighbour in
, and
the minimum distance of
\overset{\sim}{x}i\in\overset{\sim}{X}\subseteqX
to its nearest neighbour
xj\inX,\overset{\sim}{xi}\nexj.
Definition
With the above notation, if the data is
dimensional, then the Hopkins statistic is defined as:
[4]
} \,
Under the null hypotheses, this statistic has a Beta(m,m) distribution.
Notes and references
- A new method for determining the type of distribution of plant individuals . Hopkins . Big D Randy . Skellam . Harry Kimmel I Gordon . Annals of Botany . 18 . 2 . 213–227 . 1954 . Annals Botany Co. 10.1093/oxfordjournals.aob.a083391 .
- Book: Banerjee, A.
. 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) . Validating clusters using the Hopkins statistic . 149–153 . 10.1109/FUZZY.2004.1375706 . 2004. 1 . 0-7803-8353-2 . 36701919 .
- Book: Aggarwal, Charu C.. Data Mining. 2015. Springer International Publishing. 978-3-319-14141-1. Cham. 158. en. 10.1007/978-3-319-14142-8. 13595565 .
- Cross . G.R. . Jain . A.K. . Measurement of clustering tendency . Theory and Application of Digital Control . 1982 . 315-320 . 10.1016/B978-0-08-027618-2.50054-1.
External links
- http://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning