Hierarchical clustering of networks explained

Hierarchical clustering is one method for finding community structures in a network. The technique arranges the network into a hierarchy of groups according to a specified weight function. The data can then be represented in a tree structure known as a dendrogram. Hierarchical clustering can either be agglomerative or divisive depending on whether one proceeds through the algorithm by adding links to or removing links from the network, respectively. One divisive technique is the Girvan–Newman algorithm.

Algorithm

Wij

is first assigned to each pair of vertices

(i,j)

in the network. The weight, which can vary depending on implementation (see section below), is intended to indicate how closely related the vertices are. Then, starting with all the nodes in the network disconnected, begin pairing nodes from highest to lowest weight between the pairs (in the divisive case, start from the original network and remove links from lowest to highest weight). As links are added, connected subsets begin to form. These represent the network's community structures.

The components at each iterative step are always a subset of other structures. Hence, the subsets can be represented using a tree diagram, or dendrogram. Horizontal slices of the tree at a given level indicate the communities that exist above and below a value of the weight.

Weights

There are many possible weights for use in hierarchical clustering algorithms. The specific weight used is dictated by the data as well as considerations for computational speed. Additionally, the communities found in the network are highly dependent on the choice of weighting function. Hence, when compared to real-world data with a known community structure, the various weighting techniques have been met with varying degrees of success.

Two weights that have been used previously with varying success are the number of node-independent paths between each pair of vertices and the total number of paths between vertices weighted by the length of the path. One disadvantage of these weights, however, is that both weighting schemes tend to separate single peripheral vertices from their rightful communities because of the small number of paths going to these vertices. For this reason, their use in hierarchical clustering techniques is far from optimal.[1]

Edge betweenness centrality has been used successfully as a weight in the Girvan–Newman algorithm. This technique is similar to a divisive hierarchical clustering algorithm, except the weights are recalculated with each step.

The change in modularity of the network with the addition of a node has also been used successfully as a weight.[2] This method provides a computationally less-costly alternative to the Girvan-Newman algorithm while yielding similar results.

See also

References

  1. Girvan . M. . Michelle Girvan. Newman . M. E. J. . Community structure in social and biological networks . Proceedings of the National Academy of Sciences . 99 . 12 . 2002-06-11 . 0027-8424 . 10.1073/pnas.122653799 . 7821–7826. 12060727 . 122977 . cond-mat/0112110. 2002PNAS...99.7821G . free .
  2. Newman . M. E. J. . Fast algorithm for detecting community structure in networks . Physical Review E . 69 . 6 . 2004-06-18 . 1539-3755 . 10.1103/physreve.69.066133 . 066133. 15244693 . cond-mat/0309508. 2004PhRvE..69f6133N . 301750 .