In statistics, and especially in biostatistics, cophenetic correlation[1] (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters.[2] This coefficient has also been proposed for use as a test for nested clusters.[3]
Suppose that the original data have been modeled using a cluster method to produce a dendrogram ; that is, a simplified model in which data that are "close" have been grouped into a hierarchical tree. Define the following distance measures.
x(i,j)=|Xi-Xj|
t(i,j)
Ti
Tj
Then, letting
\bar{x}
\bar{t}
c=
\sumi<j[x(i,j)-\bar{x | |
][t(i,j) |
-\bar{t}]}{\sqrt{\sumi<j[x(i,j)-\bar{x}]2\sumi<j[t(i,j)-\bar{t}]2}}.
It is possible to calculate the cophenetic correlation in R using the dendextend R package.[5]
In Python, the SciPy package also has an implementation.[6]
In MATLAB, the Statistic and Machine Learning toolbox contains an implementation.[7]