In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.[2]
Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.
Both Kendall's
\tau
\rho
Let
(x1,y1),...,(xn,yn)
xi
yi
(xi,yi)
(xj,yj)
i<j
(xi,xj)
(yi,yj)
xi>xj
yi>yj
xi<xj
yi<yj
The Kendall τ coefficient is defined as:
\tau=
(numberofconcordantpairs)-(numberofdiscordantpairs) | |
(numberofpairs) |
=1-
2(numberofdiscordantpairs) | |
{n\choose2 |
}.
where
{n\choose2}={n(n-1)\over2}
The number of discordant pairs is equal to the inversion number that permutes the y-sequence into the same order as the x-sequence.
The denominator is the total number of pair combinations, so the coefficient must be in the range −1 ≤ τ ≤ 1.
\tau=
2 | |
n(n-1) |
\sumi<jsgn(xi-xj)sgn(yi-yj)
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y or the distribution of (X,Y).
Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance .
Theorem. If the samples are independent, then the variance of is given by .
If are IID samples from the same jointly normal distribution with a known Pearson correlation coefficient , then the expectation of Kendall rank correlation has a closed-form formula.[3]
The name is credited to Richard Greiner (1909)[4] by P. A. P. Moran.[5]
A pair
\{(xi,yi),(xj,yj)\}
xi=xj
yi=yj
The Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties. It is defined as:
\tauA=
nc-nd | |
n0 |
where nc, nd and n0 are defined as in the next section.
The Tau-b statistic, unlike Tau-a, makes adjustments for ties.[6] Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
The Kendall Tau-b coefficient is defined as:
\tauB=
nc-nd | |
\sqrt{(n0-n1)(n0-n2) |
where
\begin{align} n0&=n(n-1)/2\\ n1&=\sumiti(ti-1)/2\\ n2&=\sumjuj(uj-1)/2\\ nc&=Numberofconcordantpairs\\ nd&=Numberofdiscordantpairs\\ ti&=Numberoftiedvaluesintheithgroupoftiesforthefirstquantity\\ uj&=Numberoftiedvaluesinthejthgroupoftiesforthesecondquantity \end{align}
A simple algorithm developed in BASIC computes Tau-b coefficient using an alternative formula.[7]
Be aware that some statistical packages, e.g. SPSS, use alternative formulas for computational efficiency, with double the 'usual' number of concordant and discordant pairs.[8]
Tau-c (also called Stuart-Kendall Tau-c)[9] is more suitable than Tau-b for the analysis of data based on non-square (i.e. rectangular) contingency tables.[9] [10] So use Tau-b if the underlying scale of both variables has the same number of possible values (before ranking) and Tau-c if they differ. For instance, one variable might be scored on a 5-point scale (very good, good, average, bad, very bad), whereas the other might be based on a finer 10-point scale.
The Kendall Tau-c coefficient is defined as:[10]
\tauC=
2(nc-nd) | |||||||||
|
=\tauA
n-1 | |
n |
m | |
m-1 |
where
\begin{align} nc&=Numberofconcordantpairs\\ nd&=Numberofdiscordantpairs\\ r&=Numberofrows\\ c&=Numberofcolumns\\ m&=min(r,c) \end{align}
When two quantities are statistically dependent, the distribution of
\tau
\tauA
zA
zA={nc-nd\over\sqrt{
1 | |
18 |
v0}}
where
v0=n(n-1)(2n+5)
Thus, to test whether two variables are statistically dependent, one computes
zA
-|zA|
Numerous adjustments should be added to
zA
zB
\tauB
zB={nc-nd\over\sqrt{v}}
where
\begin{array}{ccl} v&=&
1 | |
18 |
v0-(vt+vu)/18+(v1+v2)\\ v0&=&n(n-1)(2n+5)\\ vt&=&\sumiti(ti-1)(2ti+5)\\ vu&=&\sumjuj(uj-1)(2uj+5)\\ v1&=&\sumiti(ti-1)\sumjuj(uj-1)/(2n(n-1))\\ v2&=&\sumiti(ti-1)(ti-2)\sumjuj(uj-1)(uj-2)/(9n(n-1)(n-2)) \end{array}
This is sometimes referred to as the Mann-Kendall test.[11]
The direct computation of the numerator
nc-nd
Although quick to implement, this algorithm is
O(n2)
O(n ⋅ log{n})
Begin by ordering your data points sorting by the first quantity,
x
x
y
y
y
O(nlogn)
S(y)
yi
\tau
nc-nd=n0-n1-n2+n3-2S(y),
where
n3
n1
n2
x
y
A Merge Sort partitions the data to be sorted,
y
yleft
yright
S(y)=S(yleft)+S(yright)+M(Yleft,Yright)
where
Yleft
Yright
yleft
yright
M( ⋅ , ⋅ )
M( ⋅ , ⋅ )
function M(L[1..n], R[1..m]) is i := 1 j := 1 nSwaps := 0 while i ≤ n and j ≤ m do if R[j] < L[i] then nSwaps := nSwaps + n − i + 1 j := j + 1 else i := i + 1 return nSwapsA side effect of the above steps is that you end up with both a sorted version of
x
y
ti
uj
\tauB
\tauB
cor.test(x, y, method = "kendall")
in its "stats" package (also cor(x, y, method = "kendall")
will work, but the latter does not return the p-value). All three versions of the coefficient are available in the "DescTools" package along with the confidence intervals: KendallTauA(x,y,conf.level=0.95)
for \tauA
KendallTauB(x,y,conf.level=0.95)
for \tauB
StuartTauC(x,y,conf.level=0.95)
for \tauC
\tauB
scipy.stats.kendalltau