Shape context is a feature descriptor used in object recognition. Serge Belongie and Jitendra Malik proposed the term in their paper "Matching with Shape Contexts" in 2000.[1]
The shape context is intended to be a way of describing shapes that allows for measuring shape similarity and the recovering of point correspondences.[1] The basic idea is to pick n points on the contours of a shape. For each point pi on the shape, consider the n - 1 vectors obtained by connecting pi to all other points. The set of all these vectors is a rich description of the shape localized at that point but is far too detailed. The key idea is that the distribution over relative positions is a robust, compact, and highly discriminative descriptor. So, for the point pi, the coarse histogram of the relative coordinates of the remaining n - 1 points,
hi(k)=\#\{q\nepi:(q-pi)\inbin(k)\}
is defined to be the shape context of
pi
(a) and (b) are the sampled edge points of the two shapes. (c) is the diagram of the log-polar bins used to compute the shape context. (d) is the shape context for the point marked with a circle in (a), (e) is that for the point marked as a diamond in (b), and (f) is that for the triangle. As can be seen, since (d) and (e) are the shape contexts for two closely related points, they are quite similar, while the shape context in (f) is very different.
For a feature descriptor to be useful, it needs to have certain invariances. In particular it needs to be invariant to translation, scaling, small perturbations, and, depending on the application, rotation. Translational invariance comes naturally to shape context. Scale invariance is obtained by normalizing all radial distances by the mean distance
\alpha
One can provide complete rotational invariance in shape contexts. One way is to measure angles at each point relative to the direction of the tangent at that point (since the points are chosen on edges). This results in a completely rotationally invariant descriptor. But of course this is not always desired since some local features lose their discriminative power if not measured relative to the same frame. Many applications in fact forbid rotational invariance e.g. distinguishing a "6" from a "9".
A complete system that uses shape contexts for shape matching consists of the following steps (which will be covered in more detail in the Details of Implementation section):
The approach assumes that the shape of an object is essentially captured by a finite subset of the points on the internal or external contours on the object. These can be simply obtained using the Canny edge detector and picking a random set of points from the edges. Note that these points need not and in general do not correspond to key-points such as maxima of curvature or inflection points. It is preferable to sample the shape with roughly uniform spacing, though it is not critical.[2]
This step is described in detail in the Theory section.
Consider two points p and q that have normalized K-bin histograms (i.e. shape contexts) g(k) and h(k). As shape contexts are distributions represented as histograms, it is natural to use the χ2 test statistic as the "shape context cost" of matching the two points:
CS=
1 | |
2 |
K | |
\sum | |
k=1 |
[g(k)-h(k)]2 | |
g(k)+h(k) |
The values of this range from 0 to 1.[1] In addition to the shape context cost, an extra cost based on the appearance can be added. For instance, it could be a measure of tangent angle dissimilarity (particularly useful in digit recognition):
CA=
1 | |
2 |
\begin{Vmatrix} \dbinom{\cos(\theta1)}{\sin(\theta1)}-\dbinom{\cos(\theta2)}{\sin(\theta2)} \end{Vmatrix}
This is half the length of the chord in unit circle between the unit vectors with angles
\theta1
\theta2
C=(1-\beta)CS+\betaCA
Now for each point pi on the first shape and a point qj on the second shape, calculate the cost as described and call it Ci,j. This is the cost matrix.
Now, a one-to-one matching
\pi(i)
H(\pi)=\sumiC\left(pi,q\pi\right)
is needed. This can be done in
O(N3)
Given the set of correspondences between a finite set of points on the two shapes, a transformation
T:R2\toR2
The affine model is a standard choice:
T(p)=Ap+o
A
o=
1 | |
n |
n | |
\sum | |
i=1 |
\left(pi-q\pi(i)\right), A=(Q+P)t
Where
P=\begin{pmatrix} 1&p11&p12\\ \vdots&\vdots&\vdots\\ 1&pn1&pn2\end{pmatrix}
Q
Q+
Q
The thin plate spline (TPS) model is the most widely used model for transformations when working with shape contexts. A 2D transformation can be separated into two TPS function to model a coordinate transform:
T(x,y)=\left(fx(x,y),fy(x,y)\right)
where each of the ƒx and ƒy have the form:
f(x,y)=a1+axx+ayy+
n\omega | |
\sum | |
iU\left |
(\begin{Vmatrix} (xi,yi)-(x,y)\end{Vmatrix}\right),
and the kernel function
U(r)
U(r)=r2logr2
The TPS formulation above has exact matching requirement for the pairs of points on the two shapes. For noisy data, it is best to relax this exact requirement. If we let
vi
pi=(xi,yi)
fx
vi
x'
pi
fy
y'
H[f]=
n(v | |
\sum | |
i |
-f(xi,y
2 | |
i)) |
+λIf
where
If
λ
(xi,yi)and(x'i,y'i)
Note that in many cases, regardless of the transformation used, the initial estimate of the correspondences contains some errors which could reduce the quality of the transformation. If we iterate the steps of finding correspondences and estimating transformations (i.e. repeating steps 2 - 5 with the newly transformed shape) we can overcome this problem. Typically, three iterations are all that is needed to obtain reasonable results.
Now, a shape distance between two shapes
P
Q
Shape context distance: this is the symmetric sum of shape context matching costs over best matching points:
Dsc(P,Q)=
1 | |
n |
\sump\arg\underset{q\inQ}{min}C(p,T(q))+
1 | |
m |
\sumq\arg\underset{p\inP}{min}C(p,T(q))
where T(·) is the estimated TPS transform that maps the points in Q to those in P.
Appearance cost: After establishing image correspondences and properly warping one image to match the other, one can define an appearance cost as the sum of squared brightness differences in Gaussian windows around corresponding image points:
Dac(P,Q)=
1 | |
n |
n\sum | |
\sum | |
\Delta\inZ2 |
G(\Delta)\left[IP(pi+\Delta)-IQ(T(q\pi(i))+\Delta)\right]2
where
IP
IQ
IQ
G
Transformation cost: The final cost
Dbe(P,Q)
Now that we have a way of calculating the distance between two shapes, we can use a nearest neighbor classifier (k-NN) with distance defined as the shape distance calculated here. The results of applying this to different situations is given in the following section.
See main article: MNIST database. The authors Serge Belongie and Jitendra Malik tested their approach on the MNIST database. Currently, more than 50 algorithms have been tested on the database. The database has a training set of 60,000 examples, and a test set of 10,000 examples. The error rate for this approach was 0.63% using 20,000 training examples and 3-NN. At the time of publication, this error rate was the lowest. Currently, the lowest error rate is 0.18%.[10]
The authors experimented with the MPEG-7 shape silhouette database, performing Core Experiment CE-Shape-1 part B, which measures performance of similarity-based retrieval.[11] The database has 70 shape categories and 20 images per shape category. Performance of a retrieval scheme is tested by using each image as a query and counting the number of correct images in the top 40 matches. For this experiment, the authors increased the number of points sampled from each shape. Also, since the shapes in the database sometimes were rotated or flipped, the authors took defined the distance between a reference shape and query shape to be minimum shape distance between the query shape and either the unchanged reference, the vertically flipped, or the reference horizontally flipped.[1] [2] [3] [4] With these changes, they obtained a retrieval rate of 76.45%, which in 2002 was the best.
The next experiment performed on shape contexts involved the 20 common household objects in the Columbia Object Image Library (COIL-20). Each object has 72 views in the database. In the experiment, the method was trained on a number of equally spaced views for each object and the remaining views were used for testing. A 1-NN classifier was used. The authors also developed an editing algorithm based on shape context similarity and k-medoid clustering that improved on their performance.[4]
Shape contexts were used to retrieve the closest matching trademarks from a database to a query trademark (useful in detecting trademark infringement). No visually similar trademark was missed by the algorithm (verified manually by the authors).[2]