Category utility is a measure of "category goodness" defined in and . It attempts to maximize both the probability that two objects in the same category have attribute values in common, and the probability that objects from different categories have different attribute values. It was intended to supersede more limited measures of category goodness such as "cue validity" (;) and "collocation index" . It provides a normative information-theoretic measure of the predictive advantage gained by the observer who possesses knowledge of the given category structure (i.e., the class labels of instances) over the observer who does not possess knowledge of the category structure. In this sense the motivation for the category utility measure is similar to the information gain metric used in decision tree learning. In certain presentations, it is also formally equivalent to the mutual information, as discussed below. A review of category utility in its probabilistic incarnation, with applications to machine learning, is provided in .
The probability-theoretic definition of category utility given in and is as follows:
CU(C,F)=\tfrac{1}{p}
\sum | |
cj\inC |
p(cj)\left
[\sum | |
fi\inF |
m | |
\sum | |
k=1 |
p(fik
2 | |
|c | |
j) |
-
\sum | |
fi\inF |
m | |
\sum | |
k=1 |
p(fik)2\right]
where
F=\{fi\}, i=1\ldotsn
n
m
C=\{cj\} j=1\ldotsp
p
p(fik)
fi
k
p(fik|cj)
fi
k
cj
The motivation and development of this expression for category utility, and the role of the multiplicand
style\tfrac{1}{p}
stylep(cj)
\sum | |
fi\inF |
m | |
\sum | |
k=1 |
p(fik
2 | |
|c | |
j) |
stylep(cj)
\sum | |
fi\inF |
m | |
\sum | |
k=1 |
p(fik)2
The information-theoretic definition of category utility for a set of entities with size-
n
F=\{fi\}, i=1\ldotsn
C=\{c,\bar{c}\}
CU(C,F)=\left[p(c)
n | |
\sum | |
i=1 |
p(fi|c)logp(fi|c)+p(\bar{c})
n | |
\sum | |
i=1 |
p(fi|\bar{c})logp(fi|\bar{c})\right]-
n | |
\sum | |
i=1 |
p(fi)logp(fi)
where
p(c)
c
p(fi|c)
fi
c
p(fi|\bar{c})
fi
\bar{c}
p(fi)
fi
The intuition behind the above expression is as follows: The term
p(c)style
n | |
\sum | |
i=1 |
p(fi|c)logp(fi|c)
c
p(\bar{c})style
n | |
\sum | |
i=1 |
p(fi|\bar{c})logp(fi|\bar{c})
\bar{c}
style
n | |
\sum | |
i=1 |
p(fi)logp(fi)
and mention that the category utility is equivalent to the mutual information. Here is a simple demonstration of the nature of this equivalence. Assume a set of entities each having the same
n
F=\{fi\}, i=1\ldotsn
m
m
m=2
m
F
Fa
mn
vi, i=1\ldotsmn
⊗ F
p(Fa=vi)
p(vi)
Fa
vi
Fa
For this demonstration, also assume a single category variable
C
p
p
p=2
I(Fa;C)
Fa
C
I(Fa;C)=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi,cj)log
p(vi,cj) | |
p(vi)p(cj) |
where
p(vi)
Fa
vi
p(cj)
C
cj
p(vi,cj)
Fa
C
\begin{align} I(Fa;C)&=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi,cj)log
p(vi|cj) | |
p(vi) |
\\ &=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)\left[logp(vi|cj)-logp(vi)\right]\\ &=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)logp(vi|cj)-
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)logp(vi)\\ &=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)logp(vi|cj)-
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi,cj)logp(vi)\\ &=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)logp(vi|cj)-
\sum | |
vi\inFa |
logp(vi)
\sum | |
cj\inC |
p(vi,cj)\\ &=
\sum | |
vi\inFa |
\sum | |
cj\inC |
p(vi|cj)p(cj)logp(vi|cj)-
\sum | |
vi\inFa |
p(vi)logp(vi)\\ \end{align}
If the original definition of the category utility from above is rewritten with
C=\{c,\bar{c}\}
CU(C,F)=
\sum | |
fi\inF |
\sum | |
cj\inC |
p(fi|cj)p(cj)logp(fi|cj)-
\sum | |
fi\inF |
p(fi)logp(fi)
This equation clearly has the same form as the (
blue) equation expressing the mutual information between the feature set and the category variable; the difference is that the sumstyle\sum | |
fi\inF |
F=\{fi\}, i=1\ldotsn
style\sum | |
vi\inFa |
mn
Fa
\{fi\}
p(\bar{fi})
Like the mutual information, the category utility is not sensitive to any ordering in the feature or category variable values. That is, as far as the category utility is concerned, the category set {small,medium,large,jumbo}
is not qualitatively different from the category set {desk,fish,tree,mop}
since the formulation of the category utility does not account for any ordering of the class variable. Similarly, a feature variable adopting values {1,2,3,4,5}
is not qualitatively different from a feature variable adopting values {fred,joe,bob,sue,elaine}
. As far as the category utility or mutual information are concerned, all category and feature variables are nominal variables. For this reason, category utility does not reflect any gestalt aspects of "category goodness" that might be based on such ordering effects. One possible adjustment for this insensitivity to ordinality is given by the weighting scheme described in the article for mutual information.
This section provides some background on the origins of, and need for, formal measures of "category goodness" such as the category utility, and some of the history that lead to the development of this particular metric.
At least since the time of Aristotle there has been a tremendous fascination in philosophy with the nature of concepts and universals. What kind of entity is a concept such as "horse"? Such abstractions do not designate any particular individual in the world, and yet we can scarcely imagine being able to comprehend the world without their use. Does the concept "horse" therefore have an independent existence outside of the mind? If it does, then what is the locus of this independent existence? The question of locus was an important issue on which the classical schools of Plato and Aristotle famously differed. However, they remained in agreement that universals did indeed have a mind-independent existence. There was, therefore, always a fact to the matter about which concepts and universals exist in the world.
In the late Middle Ages (perhaps beginning with Occam, although Porphyry also makes a much earlier remark indicating a certain discomfort with the status quo), however, the certainty that existed on this issue began to erode, and it became acceptable among the so-called nominalists and empiricists to consider concepts and universals as strictly mental entities or conventions of language. On this view of concepts—that they are purely representational constructs—a new question then comes to the fore: "Why do we possess one set of concepts rather than another?" What makes one set of concepts "good" and another set of concepts "bad"? This is a question that modern philosophers, and subsequently machine learning theorists and cognitive scientists, have struggled with for many decades.
One approach to answering such questions is to investigate the "role" or "purpose" of concepts in cognition. Thus the answer to "What are concepts good for in the first place?" by and many others is that classification (conception) is a precursor to induction: By imposing a particular categorization on the universe, an organism gains the ability to deal with physically non-identical objects or situations in an identical fashion, thereby gaining substantial predictive leverage (;). As J.S. Mill puts it,
From this base, Mill reaches the following conclusion, which foreshadows much subsequent thinking about category goodness, including the notion of category utility:
One may compare this to the "category utility hypothesis" proposed by : "A category is useful to the extent that it can be expected to improve the ability of a person to accurately predict the features of instances of that category." Mill here seems to be suggesting that the best category structure is one in which object features (properties) are maximally informative about the object's class, and, simultaneously, the object class is maximally informative about the object's features. In other words, a useful classification scheme is one in which category knowledge can be used to accurately infer object properties, and property knowledge can be used to accurately infer object classes. One may also compare this idea to Aristotle's criterion of counter-predication for definitional predicates, as well as to the notion of concepts described in formal concept analysis.
A variety of different measures have been suggested with an aim of formally capturing this notion of "category goodness," the best known of which is probably the "cue validity". Cue validity of a feature
fi
cj
p(cj|fi)
p(cj|fi)-p(cj)
p(fi|cj)
One attempt to address both problems by simultaneously maximizing both feature validity and category validity was made by in defining the "collocation index" as the product
p(cj|fi)p(fi|cj)