The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.
The problem of defining a generative model for object recognition is difficult. The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale. Ideally, we would like the particular representation we choose to be robust to as many of these factors as possible.
In category-level recognition, the problem is even more challenging because of the fundamental problem of intra-class variation. Even if two objects belong to the same visual category, their appearances may be significantly different. However, for structured objects such as cars, bicycles, and people, separate instances of objects from the same category are subject to similar geometric constraints. For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions. The Constellation Model takes advantage of this fact by explicitly modeling the relative location, relative scale, and appearance of these parts for a particular object category. Model parameters are estimated using an unsupervised learning algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories. It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error.
The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973.[1] This model has since been built upon and extended in many directions. The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach.
In the late '90s, Burl et al.[2] [3] [4] [5] revisited the Fischler and Elschlager model for the purpose of face recognition. In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied. In 2000, Weber et al. [6] [7] [8] [9] made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts. Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data. Fergus et al.[10] [11] then improved upon this model by making the learning step fully unsupervised, having both shape and appearance learned simultaneously, and accounting explicitly for the relative scale of parts.
In the first step, a standard interest point detection method, such as Harris corner detection, is used to generate interest points. Image features generated from the vicinity of these points are then clustered using k-means or another appropriate algorithm. In this process of vector quantization, one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts. Appropriate feature detectors are then trained using these clusters, which can be used to obtain a set of candidate parts from images.
As a result of this process, each image can now be represented as a set of parts. Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space.
Weber & Welling here introduce the concept of foreground and background. Foreground parts correspond to an instance of a target object class, whereas background parts correspond to background clutter or false detections.
Let T be the number of different types of parts. The positions of all parts extracted from an image can then be represented in the following "matrix,"
Xo= \begin{pmatrix} x11,x12,{ … }
,x | |
1N1 |
\\ x21,x22,{ … }
,x | |
2N2 |
\\ \vdots\\ xT1,xT2,{ … }
,x | |
TNT |
\end{pmatrix}
where
Ni
i\in\{1,...,T\}
xm
F
F=T
F>T
h
hi=j
xij
Xo
p(Xo,xm,h)
The rest of this section summarizes the details of Weber & Welling's model for a single component model. The formulas for multiple component models are extensions of those described here.
To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables
b
n
b
bi=1
hi>0
bi=0
n
ni
ith
Xo
b
n
h
Xo
p(Xo,xm,h)=p(Xo,xm,h,n,b)
p(Xo,xm,h,n,b)=p(Xo,xm|h,n,b)p(h|n,b)p(n)p(b)
The probability density over the number of background detections can be modeled by a Poisson distribution,
p(n)=
T | |
\prod | |
i=1 |
1 | |
ni! |
ni | |
(M | |
i) |
-Mi | |
e |
where
Mi
i
Depending on the number of parts
F
p(b)
2F
F
F
The density
p(h|n,b)
p(h|n,b)= \begin{cases}
1 | ||||||||||||||
|
,&ifh\inH(b,n)\\ 0,&forotherh \end{cases}
where
H(b,n)
b
n
Nf
f
F | |
style\prod | |
f=1 |
bf | |
N | |
f |
And finally,
p(Xo,xm|h,n)=pfg(z)pbg(xbg)
where
z=(xoxm)
xbg
pfg(z)
\mu
\Sigma
The ultimate objective of this model is to classify images into classes "object present" (class
C1
C0
Xo
| |||||||
|
\propto
| |||||||||||||
|
where
h0
After the preliminary step of interest point detection, feature generation and clustering, we have a large set of candidate parts over the training images. To learn the model, Weber & Welling first perform a greedy search over possible model configurations, or equivalently, over potential subsets of the candidate parts. This is done in an iterative fashion, starting with random selection. At subsequent iterations, parts in the model are randomly substituted, the model parameters are estimated, and the performance is assessed. The process is complete when further model performance improvements are no longer possible.
At each iteration, the model parameters
\Theta=\{\mu,\Sigma,p(b),M\}
are estimated using expectation maximization.
\mu
\Sigma
pfg(z)
p(b)
M
EM proceeds by maximizing the likelihood of the observed data,
L(Xo|\Theta)=
I | |
\sum | |
i=1 |
log
\sum | |
hi |
\int
m,h | |
p(X | |
i|\Theta)dx |
m | |
i |
with respect to the model parameters
\Theta
Q(\tilde{\Theta}|\Theta)=
I | |
\sum | |
i=1 |
E[log
m,h | |
p(X | |
i|\tilde{\Theta})] |
Taking the derivative of this with respect to the parameters and equating to zero produces the update rules:
\tilde{\mu}=
1 | |
I |
I | |
\sum | |
i=1 |
E[zi]
\tilde{\Sigma}=
1 | |
I |
I | |
\sum | |
i=1 |
E[ziz
T] | |
i |
-\tilde{\mu}\tilde{\mu}T
\tilde{p}(\bar{b})=
1 | |
I |
I | |
\sum | |
i=1 |
E[\deltab,\bar{b
\tilde{M}=
1 | |
I |
I | |
\sum | |
i=1 |
E[ni]
The update rules in the M-step are expressed in terms of sufficient statistics,
E[z]
E[zzT]
E[\deltab,\bar{b
E[n]
p(hi,x
o,\Theta) | |
i |
=
| ||||||||||||||||||
|
In Weber et al., shape and appearance models are constructed separately. Once the set of candidate parts had been selected, shape is learned independently of appearance. The innovation of Fergus et al. is to learn not only two, but three model parameters simultaneously: shape, appearance, and relative scale. Each of these parameters are represented by Gaussian densities.
Whereas the preliminary step in the Weber et al. method is to search for the locations of interest points, Fergus et al. use the detector of Kadir and Brady[12] to find salient regions in the image over both location (center) and scale (radius). Thus, in addition to location information
X
S
A
Given a particular object class model with parameters
\Theta
R=
p(Object|X,S,A) | |
p(Noobject|X,S,A) |
=
p(X,S,A|Object)p(Object) | |
p(X,S,A|Noobject)p(Noobject) |
≈
p(X,S,A|\Theta)p(Object) | |
p(X,S,A|\Thetabg)p(Noobject) |
where
\Thetabg
T
The likelihoods are factored as follows:
p(X,S,A|\Theta)=\sumhp(X,S,A,h|\Theta)=
\sumh\underbrace{p(A|X,S,h,\Theta)}Appearance\underbrace{p(X|S,h,\Theta)}Shape\underbrace{p(S|h,\Theta)}Rel.Scale\underbrace{p(h|\Theta)}Other
Each part
p
app | |
\Theta | |
p |
=\{cp,Vp\}
app | |
\Theta | |
bg |
=\{cbg,Vbg\}
p(A|X,S,h,\Theta)=p(A|h,\Theta)
p(A|X,S,h,\Theta) | |
p(A|X,S,h,\Thetabg) |
=
p(A|h,\Theta) | |
p(A|h,\Thetabg) |
=
P | |
\prod | |
p=1 |
\left(
G(A(hp)|cp,Vp) | |
G(A(hp)|cbg,Vbg) |
\right
bp | |
) |
Recall from Weber et al. that
h
b
Shape is represented by a joint Gaussian density of part locations within a particular hypothesis, after those parts have been transformed into a scale-invariant space. This transformation precludes the need to perform an exhaustive search over scale. The Gaussian density has parameters
\Thetashape=\{\mu,\Sigma\}
\Thetabg
\alpha
f
p(X|S,h,\Theta) | |
p(X|S,h,\Thetabg) |
=G(X(h)|\mu,\Sigma)\alphaf
The scale of each part
p
\Thetascale=\{tp,Up\}
\Thetabg
r
p(S|h,\Theta) | |
p(S|h,\Thetabg) |
=
P | |
\prod | |
p=1 |
G(S(hp)|tp,U
dp | |
p) |
rf
p(h|\Theta) | |
p(h|\Thetabg) |
=
pPoiss(n|M) | |
pPoiss(N|M) |
1 | ||||||
|
p(b|\Theta)
The first factor models the number of features detected using a Poisson distribution, which has mean M. The second factor serves as a "book-keeping" factor for the hypothesis variable. The last factor is a probability table for all possible occlusion patterns.
The task of learning the model parameters
\Theta=\{\mu,\Sigma,c,V,M,p(b|\Theta),t,U\}
The Constellation Model as conceived by Fergus et al. achieves successful categorization rates consistently above 90% on large datasets of motorbikes, faces, airplanes, and spotted cats.[13] For each of these datasets, the Constellation Model is able to capture the "essence" of the object class in terms of appearance and/or shape. For example, face and motorbike datasets generate very tight shape models because objects in those categories have very well-defined structure, whereas spotted cats vary significantly in pose, but have a very distinctive spotted appearance. Thus, the model succeeds in both cases. It is important to note that the Constellation Model does not generally account for significant changes in orientation. Thus, if the model is trained on images of horizontal airplanes, it will not perform well on, for instance, images of vertically oriented planes unless the model is extended to account for this sort of rotation explicitly.
In terms of computational complexity, the Constellation Model is very expensive. If
N
P
H
O(NP)
P\le6
N
One variation that attempts to reduce complexity is the star model proposed by Fergus et al.[14] The reduced dependencies of this model allows for learning in
O(N2P)
O(NP)