Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based[1] clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander.[2] Its basic idea is similar to DBSCAN,[3] but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram.
Like DBSCAN, OPTICS requires two parameters:, which describes the maximum distance (radius) to consider, and, describing the number of points required to form a cluster. A point is a core point if at least points are found within its -neighborhood
N\varepsilon(p)
core-dist\varepsilon,MinPts(p)= \begin{cases} UNDEFINED&if|N\varepsilon(p)|<MinPts\ MinPts-thsmallestdistanceinN\varepsilon(p)&otherwise \end{cases}
The reachability-distance of another point from a point is either the distance between and, or the core distance of, whichever is bigger:
reachability-dist\varepsilon,MinPts(o,p)=\begin{cases} UNDEFINED&if|N\varepsilon(p)|<MinPts\ max(core-dist\varepsilon,MinPts(p),dist(p,o))&otherwise \end{cases}
If and are nearest neighbors, this is the
\varepsilon'<\varepsilon
Both core-distance and reachability-distance are undefined if no sufficiently dense cluster (w.r.t.) is available. Given a sufficiently large, this never happens, but then every -neighborhood query returns the entire database, resulting in
O(n2)
The parameter is, strictly speaking, not necessary. It can simply be set to the maximum possible value. When a spatial index is available, however, it does play a practical role with regards to complexity. OPTICS abstracts from DBSCAN by removing this parameter, at least to the extent of only having to give the maximum value.
The basic approach of OPTICS is similar to DBSCAN, but instead of maintaining known, but so far unprocessed cluster members in a set, they are maintained in a priority queue (e.g. using an indexed heap).
function OPTICS(DB, ε, MinPts) is for each point p of DB do p.reachability-distance = UNDEFINED for each unprocessed point p of DB do N = getNeighbors(p, ε) mark p as processed output p to the ordered list if core-distance(p, ε, MinPts) != UNDEFINED then Seeds = empty priority queue update(N, p, Seeds, ε, MinPts) for each next q in Seeds do N' = getNeighbors(q, ε) mark q as processed output q to the ordered list if core-distance(q, ε, MinPts) != UNDEFINED do update(N', q, Seeds, ε, MinPts)
In update, the priority queue Seeds is updated with the
\varepsilon
p
q
function update(N, p, Seeds, ε, MinPts) is coredist = core-distance(p, ε, MinPts) for each o in N if o is not processed then new-reach-dist = max(coredist, dist(p,o)) if o.reachability-distance
OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).
Using a reachability-plot (a special kind of dendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.
The image above illustrates this concept. In its upper left area, a synthetic example data set is shown. The upper right part visualizes the spanning tree produced by OPTICS, and the lower part shows the reachability plot as computed by OPTICS. Colors in this plot are labels, and not computed by the algorithm; but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot. They are usually not assigned to clusters, except the omnipresent "all data" cluster in a hierarchical result.
Extracting clusters from this plot can be done manually by selecting ranges on the x-axis after visual inspection, by selecting a threshold on the y-axis (the result is then similar to a DBSCAN clustering result with the same
\varepsilon
Like DBSCAN, OPTICS processes each point once, and performs one \varepsilon
O(logn)
O(n ⋅ logn)
O(n2)
\varepsilon
In particular, choosing
\varepsilon>maxx,yd(x,y)
\varepsilon
OPTICS-OF[5] is an outlier detection algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method. The better known version LOF is based on the same concepts.
DeLi-Clu,[6] Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the
\varepsilon
HiSC[7] is a hierarchical subspace clustering (axis-parallel) method based on OPTICS.
HiCO[8] is a hierarchical correlation clustering algorithm based on OPTICS.
DiSH[9] is an improvement over HiSC that can find more complex hierarchies.
FOPTICS[10] is a faster implementation using random projections.
HDBSCAN*[11] is based on a refinement of DBSCAN, excluding border-points from the clusters and thus following more strictly the basic definition of density-levels by Hartigan.[12]
Java implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in the ELKI data mining framework (with index acceleration for several distance functions, and with automatic cluster extraction using the ξ extraction method). Other Java implementations include the Weka extension (no support for ξ cluster extraction).
The R package "dbscan" includes a C++ implementation of OPTICS (with both traditional dbscan-like and ξ cluster extraction) using a k-d tree for index acceleration for Euclidean distance only.
Python implementations of OPTICS are available in the PyClustering library and in scikit-learn. HDBSCAN* is available in the hdbscan library.