In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.
This method tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters. For some classes of data, this may lead to difficulties in defining classes that could usefully subdivide the data.[1] However, it is popular in astronomy for analyzing galaxy clusters, which may often involve long strings of matter; in this application, it is also known as the friends-of-friends algorithm.[2]
In the beginning of the agglomerative clustering process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters, until all elements end up being in the same cluster. At each step, the two clusters separated by the shortest distance are combined. The function used to determine the distance between two clusters, known as the linkage function, is what differentiates the agglomerative clustering methods.
In single-linkage clustering, the distance between two clusters is determined by a single pair of elements: those two elements (one in each cluster) that are closest to each other. The shortest of these pairwise distances that remain at any step causes the two clusters whose elements are involved to be merged. The method is also known as nearest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence in which clusters were merged and the distance at which each merge took place.[3]
Mathematically, the linkage function – the distance D(X,Y) between clusters X and Y – is described by the expression
D(X,Y)=minx\ind(x,y),
The following algorithm is an agglomerative scheme that erases rows and columns in a proximity matrix as old clusters are merged into new ones. The
N x N
D
d(i,j)
0,1,\ldots,n-1
L(k)
k
(r)
(s)
d[(r),(s)]
The single linkage algorithm is composed of the following steps:
L(0)=0
m=0
(r),(s)
d[(r),(s)]=mind[(i),(j)]
m=m+1
(r)
(s)
m
L(m)=d[(r),(s)]
D
(r)
(s)
(r,s)
(k)
d[(r,s),(k)]=min\{d[(k),(r)],d[(k),(s)]\}
This working example is based on a JC69 genetic distance matrix computed from the 5S ribosomal RNA sequence alignment of five bacteria: Bacillus subtilis (
a
b
c
d
e
Let us assume that we have five elements
(a,b,c,d,e)
D1
a | b | c | d | e | ||
---|---|---|---|---|---|---|
a | 0 | style=background:#ffffcc; | 17 | 21 | 31 | 23 |
b | style=background:#ffffcc; | 17 | 0 | 30 | 34 | 21 |
c | 21 | 30 | 0 | 28 | 39 | |
d | 31 | 34 | 28 | 0 | 43 | |
e | 23 | 21 | 39 | 43 | 0 |
In this example,
D1(a,b)=17
D1
Let denote the node to which and are now connected. Setting
\delta(a,u)=\delta(b,u)=D1(a,b)/2
\delta(a,u)=\delta(b,u)=17/2=8.5
We then proceed to update the initial proximity matrix
D1
D2
D2
(a,b)
\begin{array}{lllllll} D2((a,b),c)&=&min(D1(a,c),D1(b,c))&=&min(21,30)&=&21 \\ D2((a,b),d)&=&min(D1(a,d),D1(b,d))&=&min(31,34)&=&31 \\ D2((a,b),e)&=&min(D1(a,e),D1(b,e))&=&min(23,21)&=&21 \end{array}
D2
We now reiterate the three previous actions, starting from the new distance matrix
D2
(a,b) | c | d | e | |||
---|---|---|---|---|---|---|
(a,b) | 0 | style=background:#ffffcc; | 21 | 31 | style=background:#ffffcc; | 21 |
c | style=background:#ffffcc; | 21 | 0 | 28 | 39 | |
d | 31 | 28 | 0 | 43 | ||
e | style=background:#ffffcc; | 21 | 39 | 43 | 0 |
Here,
D2((a,b),c)=21
D2((a,b),e)=21
D2
(a,b)
Let denote the node to which
(a,b)
\delta(a,v)=\delta(b,v)=\delta(c,v)=\delta(e,v)=21/2=10.5
We deduce the missing branch length:
\delta(u,v)=\delta(c,v)-\delta(a,u)=\delta(c,v)-\delta(b,u)=10.5-8.5=2
We then proceed to update the
D2
D3
(a,b)
D3(((a,b),c,e),d)=min(D2((a,b),d),D2(c,d),D2(e,d))=min(31,28,43)=28
The final
D3
((a,b),c,e) | d | |||
---|---|---|---|---|
((a,b),c,e) | 0 | style=background:#ffffcc; | 28 | |
d | style=background:#ffffcc; | 28 | 0 |
So we join clusters
((a,b),c,e)
d
Let
r
((a,b),c,e)
d
((a,b),c,e)
d
r
\delta(((a,b),c,e),r)=\delta(d,r)=28/2=14
We deduce the remaining branch length:
\delta(v,r)=\delta(a,r)-\delta(a,v)=\delta(b,r)-\delta(b,v)=\delta(c,r)-\delta(c,v)=\delta(e,r)-\delta(e,v)=14-10.5=3.5
The dendrogram is now complete. It is ultrametric because all tips (
a
b
c
e
d
r
\delta(a,r)=\delta(b,r)=\delta(c,r)=\delta(e,r)=\delta(d,r)=14
The dendrogram is therefore rooted by
r
The naive algorithm for single linkage clustering is essentially the same as Kruskal's algorithm for minimum spanning trees. However, in single linkage clustering, the order in which clusters are formed is important, while for minimum spanning trees what matters is the set of pairs of points that form distances chosen by the algorithm.
Alternative linkage schemes include complete linkage clustering, average linkage clustering (UPGMA and WPGMA), and Ward's method. In the naive algorithm for agglomerative clustering, implementing a different linkage scheme may be accomplished simply by using a different formula to calculate inter-cluster distances in the algorithm. The formula that should be adjusted has been highlighted using bold text in the above algorithm description. However, more efficient algorithms such as the one described below do not generalize to all linkage schemes in the same way.
- | Single-linkage clustering |
The naive algorithm for single-linkage clustering is easy to understand but slow, with time complexity
O(n3)
O(n2)
O(n)
n
C
i
\pi
i
C
λ
i
C
O(n)
O(n)
An alternative algorithm, running in the same optimal time and space bounds, is based on the equivalence between the naive algorithm and Kruskal's algorithm for minimum spanning trees. Instead of using Kruskal's algorithm, one can use Prim's algorithm, in a variation without binary heaps that takes time
O(n2)
O(n)
O(nlogn)
O(n)