In computer science, the earth mover's distance (EMD)[1] is a measure of dissimilarity between two frequency distributions, densities, or measures, over a metric space D.Informally, if the distributions are interpreted as two different ways of piling up earth (dirt) over D, the EMD captures the minimum cost of building the smaller pile using dirt taken from the larger, where cost is defined as the amount of dirt moved multiplied by the distance over which it is moved.
W1
The EMD between probability distributions and can be defined as an infimum over joint probabilities:
EMD(P,Q)=inf\limits\gammaE(x,\left[d(x,y)\right]
where
\Pi(P,Q)
P
Q
By Kantorovich-Rubinstein duality, this can also be expressed as:
EMD(P,Q)=
\sup\limits | |
\|f\|L\leq1 |
Ex[f(x)]-Ey[f(y)]
where the supremum is taken over all 1-Lipschitz continuous functions, i.e.
\|\nablaf(x)\|\leq1 \forallx
In some applications, it is convenient to represent a distribution as a signature, or a collection of clusters, where the -th cluster represents a feature of mass centered at . In this formulation, consider signatures and . Let be the ground distance between clusters and . Then the EMD between and is given by the optimal flow , with the flow between and , that minimizes the overall cost.
min\limitsF
n | |
{\sum | |
j=1 |
fi,jdi,j
subject to the constraints:
fi,j\ge0,1\lei\lem,1\lej\len
n | |
\sum | |
j=1 |
{fi,j
m | |
\sum | |
i=1 |
{fi,j
n | |
\sum | |
j=1 |
fi,j=min\left\{
m | |
\sum | |
i=1 |
wpi,
n | |
\sum | |
j=1 |
wq \right\}
The optimal flow is found by solving this linear optimization problem. The earth mover's distance is defined as the work normalized by the total flow:
EMD(P,Q)=
| ||||||||||||||||
|
Some applications may require the comparison of distributions with different total masses. One approach is to allow for partial matching,[1] where dirt from the more massive distribution is rearranged to make the less massive, and any leftover "dirt" is discarded at no cost. Formally, let be the total weight of , and be the total weight of . We have:
EMD(P,Q)=\tfrac{1}{min(wP,wQ)}
inf\limits | |
\gamma\in\Pi\geq(P,Q) |
\intd(x,y)d\gamma(x,y)
where
\Pi\geq(P,Q)
\geqP
\geqQ
An alternative approach is to allow for mass to be created or destroyed, on a global or local level, as an alternative to transportation, but with a cost penalty. In that case one must specify a real parameter
\alpha
\alpha
\widehat{EMD}\alpha
The EMD can be extended naturally to the case where more than two distributions are compared. In this case, the "distance" between the many distributions is defined as the optimal value of a linear program. This generalized EMD may be computed exactly using a greedy algorithm, and the resulting functional has been shown to be Minkowski additive and convex monotone.[5]
The EMD can be computed by solving an instance of transportation problem, using any algorithm for minimum-cost flow problem, e.g. the network simplex algorithm.
The Hungarian algorithm can be used to get the solution if the domain D is the set . If the domain is integral, it can be translated for the same algorithm by representing integral bins as multiple binary bins.
As a special case, if D is a one-dimensional array of "bins" of length n, the EMD can be efficiently computed by scanning the array and keeping track of how much dirt needs to be transported between consecutive bins. Here the bins are zero-indexed:
\begin{align} EMD0&=0\\ EMDi+1&=Pi+EMDi-Qi\\ TotalDistance&=
n | |
\sum | |
i=0 |
|EMDi| \end{align}
EMD-based similarity analysis (EMDSA) is an important and effective tool in many multimedia information retrieval[6] and pattern recognition[7] applications. However, the computational cost of EMD is super-cubic to the number of the "bins" given an arbitrary "D". Efficient and scalable EMD computation techniques for large scale data have been investigated using MapReduce,[8] [9] as well as bulk synchronous parallel and resilient distributed dataset.[10]
An early application of the EMD in computer science was to compare two grayscale images that may differ due to dithering, blurring, or local deformations.[11] In this case, the region is the image's domain, and the total amount of light (or ink) is the "dirt" to be rearranged.
The EMD is widely used in content-based image retrieval to compute distances between the color histograms of two digital images. In this case, the region is the RGB color cube, and each image pixel is a parcel of "dirt". The same technique can be used for any other quantitative pixel attribute, such as luminance, gradient, apparent motion in a video frame, etc..
More generally, the EMD is used in pattern recognition to compare generic summaries or surrogates of data records called signatures.[1] A typical signature consists of list of pairs ((x1,m1), ... (xn,mn)), where each xi is a certain "feature" (e.g., color in an image, letter in a text, etc.), and mi is "mass" (how many times that feature occurs in the record). Alternatively, xi may be the centroid of a data cluster, and mi the number of entities in that cluster. To compare two such signatures with the EMD, one must define a distance between features, which is interpreted as the cost of turning a unit mass of one feature into a unit mass of the other. The EMD between two signatures is then the minimum cost of turning one of them into the other.
EMD analysis has been used for quantitating multivariate changes in biomarkers measured by flow cytometry, with potential applications to other technologies that report distributions of measurements.[12]
The concept was first introduced by Gaspard Monge in 1781,[13] in the context of transportation theory. The use of the EMD as a distance measure for monochromatic images was described in 1989 by S. Peleg, M. Werman and H. Rom.[11] The name "earth mover's distance" was proposed by J. Stolfi in 1994,[14] and was used in print in 1998 by Y. Rubner, C. Tomasi and L. G. Guibas.[1]