The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.
It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v)[1] and now is often called the critical success index in meteorology.[2] It was later developed independently by Paul Jaccard, originally giving the French name French: coefficient de communauté,[3] [4] and independently formulated again by T. Tanimoto. Thus, it is also called Tanimoto index or Tanimoto coefficient in some fields. However, they are identical in generally taking the ratio of size of intersection over union.
The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets:
J(A,B)=
|A\capB| | |
|A\cupB| |
=
|A\capB| | |
|A|+|B|-|A\capB| |
.
Note that by design,
0\leJ(A,B)\le1.
Jaccard similarity also applies to bags, i.e., multisets. This has a similar formula,[6] but the symbols used represent bag intersection and bag sum (not union). The maximum value is 1/2.
J(A,B)=
|A\capB| | |
|A\uplusB| |
=
|A\capB| | |
|A|+|B| |
.
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1 or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
dJ(A,B)=1-J(A,B)=
|A\cupB|-|A\capB| | |
|A\cupB| |
.
An\triangleB=(A\cupB)-(A\capB)
This distance is a metric on the collection of all finite sets.[7] [8] [9]
There is also a version of the Jaccard distance for measures, including probability measures. If
\mu
X
J\mu(A,B)=
\mu(A\capB) | |
\mu(A\cupB) |
,
d\mu(A,B)=1-J\mu(A,B)=
\mu(An\triangleB) | |
\mu(A\cupB) |
.
Care must be taken if
\mu(A\cupB)=0
infty
The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
M11
M01
M10
M00
0 | 1 | ||
---|---|---|---|
0 | M00 | M10 | |
1 | M01 | M11 |
Each attribute must fall into one of these four categories, meaning that
M11+M01+M10+M00=n.
The Jaccard similarity coefficient, J, is given as
J={M11\overM01+M10+M11
The Jaccard distance, dJ, is given as
dJ={M01+M10\overM01+M10+M11
Statistical inference can be made based on the Jaccard similarity coefficients, and consequently related metrics. Given two sample sets A and B with n attributes, a statistical test can be conducted to see if an overlap is statistically significant. The exact solution is available, although computation can be costly as n increases. Estimation methods are available either by approximating a multinomial distribution or by bootstrapping.
When used for binary attributes, the Jaccard index is very similar to the simple matching coefficient. The main difference is that the SMC has the term
M00
In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.
In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.
If
x=(x1,x2,\ldots,xn)
y=(y1,y2,\ldots,yn)
xi,yi\geq0
Jl{W}(x,y)=
\sumimin(xi,yi) | |
\sumimax(xi,yi) |
,
and Jaccard distance (also known then as Soergel distance)
dJl{W
With even more generality, if
f
g
X
\mu
Jl{W}(f,g)=
\intmin(f,g)d\mu | |
\intmax(f,g)d\mu |
,
where
max
min
dJl{W
Then, for example, for two measurable sets
A,B\subseteqX
J\mu(A,B)=J(\chiA,\chiB),
\chiA
\chiB
The weighted Jaccard similarity described above generalizes the Jaccard Index to positive vectors, where a set corresponds to a binary vector given by the indicator function, i.e.
xi\in\{0,1\}
xi=\begin{cases}
1 | |
|X| |
&i\inX\ 0&otherwise\end{cases}
It is always less if the sets differ in size. If
|X|>|Y|
xi=1X(i)/|X|,yi=1Y(i)/|Y|
Jl{W}(x,y)=
|X\capY| | |
|X\setminusY|+|X| |
<J(X,Y).
Instead, a generalization that is continuous between probability distributions and their corresponding support sets is
Jl{P}(x,y)=
\sum | |
xi ≠ 0,yi ≠ 0 |
1 | ||||||||||||
|
which is called the "Probability" Jaccard.[10] It has the following bounds against the Weighted Jaccard on probability vectors.
Jl{W}(x,y)\leqJl{P}(x,y)\leq
2Jl{W | |
(x,y)}{1+J |
l{W}(x,y)}
Here the upper bound is the (weighted) Sørensen–Dice coefficient.The corresponding distance,
1-Jl{P}(x,y)
The Probability Jaccard Index has a geometric interpretation as the area of an intersection of simplices. Every point on a unit
k
k+1
k
k+1
Consider the problem of constructing random variables such that they collide with each other as much as possible. That is, if
X\simx
Y\simy
X
Y
\Pr[X=Y]
x,y
\Pr[X=Y]
1-TV(x,y)
TV
x
\Pr[X=Y]
x,y
For any sampling method
G
x,y
\Pr[G(x)=G(y)]>Jl{P}(x,y)
z
Jl{P}(x,z)>Jl{P}(x,y)
Jl{P}(y,z)>Jl{P}(x,y)
\Pr[G(x)=G(z)]<Jl{P}(x,z)
\Pr[G(y)=G(z)]<Jl{P}(y,z)
That is, no sampling method can achieve more collisions than
Jl{P}
Jl{P}
Jl{P}
This theorem has a visual proof on three element distributions using the simplex representation.
Various forms of functions described as Tanimoto similarity and Tanimoto distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard similarity and Jaccard distance, but some are mathematically different. Many sources[11] cite an IBM Technical Report[12] as the seminal reference. The report is available from several libraries.
In "A Computer Program for Classifying Plants", published in October 1960,[13] a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.
In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set (i.e. nonzero) in either sample.
Presented in mathematical terms, if samples X and Y are bitmaps,
Xi
\land,\lor
Ts
Ts(X,Y)=
\sumi(Xi\landYi) | |
\sumi(Xi\lorYi) |
If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.
Tanimoto goes on to define a "distance coefficient" based on this ratio, defined for bitmaps with non-zero similarity:
Td(X,Y)=-log2(Ts(X,Y))
This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.
Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance
1-Ts
If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as
f(A,B)=
A ⋅ B | |
\|A\|2+\|B\|2-A ⋅ B |
where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then
A ⋅ B=\sumiAiBi=\sumi(Ai\landBi)
and
\|A\|2=\sumi
2 | |
A | |
i |
=\sumiAi.
This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of
Ts
f
1-f
1-Ts
There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function
1-f
Lipkus uses a definition of Tanimoto similarity which is equivalent to
f
1-f
W
Ai\in\{0,Wi\}.
In confusion matrices employed for binary classification, the Jaccard index can be framed in the following formula:
Jaccardindex=
TP | |
TP+FP+FN |
where TP are the true positives, FP the false positives and FN the false negatives.[14]
J=S/(2-S)
S=2J/(1+J)
J
S