Hopkins statistic explained

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.^[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.^[2] If individuals are aggregated, then its value approaches 0, and if they are randomly distributed along the value tends to 0.5.^[3]

Preliminaries

A typical formulation of the Hopkins statistic follows.^[2]

Let

be the set of

data points.

Generate a random sample

\overset{\sim}{X}

m\lln

data points sampled without replacement from

Generate a set

uniformly randomly distributed data points.

Define two distance measures,

u_i,

the minimum distance (given some suitable metric) of

y_i\inY

to its nearest neighbour in

, and

w_i,

the minimum distance of

\overset{\sim}{x}_i\in\overset{\sim}{X}\subseteqX

to its nearest neighbour

x_j\inX,\overset{\sim}{x_i}\nex_j.

Definition

With the above notation, if the data is

dimensional, then the Hopkins statistic is defined as:^[4]

	d
\sum
	i

} \,

Under the null hypotheses, this statistic has a Beta(m,m) distribution.

Notes and references

A new method for determining the type of distribution of plant individuals . Hopkins . Big D Randy . Skellam . Harry Kimmel I Gordon . Annals of Botany . 18 . 2 . 213–227 . 1954 . Annals Botany Co. 10.1093/oxfordjournals.aob.a083391 .
Book: Banerjee, A. . 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) . Validating clusters using the Hopkins statistic . 149–153 . 10.1109/FUZZY.2004.1375706 . 2004. 1 . 0-7803-8353-2 . 36701919 .
Book: Aggarwal, Charu C.. Data Mining. 2015. Springer International Publishing. 978-3-319-14141-1. Cham. 158. en. 10.1007/978-3-319-14142-8. 13595565 .
Cross . G.R. . Jain . A.K. . Measurement of clustering tendency . Theory and Application of Digital Control . 1982 . 315-320 . 10.1016/B978-0-08-027618-2.50054-1.

External links

http://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning