CoBoost is a semi-supervised training algorithm proposed by Collins and Singer in 1999.[1] The original application for the algorithm was the task of named-entity recognition using very weak learners, but it can be used for performing semi-supervised learning in cases where data features may be redundant.
It may be seen as a combination of co-training and boosting. Each example is available in two views (subsections of the feature set), and boosting is applied iteratively in alternation with each view using predicted labels produced in the alternate view on the previous iteration. CoBoosting is not a valid boosting algorithm in the PAC learning sense.
CoBoosting was an attempt by Collins and Singer to improve on previous attempts to leverage redundancy in features for training classifiers in a semi-supervised fashion. CoTraining, a seminal work by Blum and Mitchell, was shown to be a powerful framework for learning classifiers given a small number of seed examples by iteratively inducing rules in a decision list. The advantage of CoBoosting to CoTraining is that it generalizes the CoTraining pattern so that it could be used with any classifier. CoBoosting accomplishes this feat by borrowing concepts from AdaBoost.
In both CoTrain and CoBoost the training and testing example sets must follow two properties. The first is that the feature space of the examples can separated into two feature spaces (or views) such that each view is sufficiently expressive for classification. Formally, there exist two functions
f1(x1)
f2(x2)
x=(x1,x2)
f1(x1)=f2(x2)=f(x)
Input:
\{(x1,i,x2,i
n | |
)\} | |
i=1 |
\{yi\}
m | |
i=1 |
Initialize:
\foralli,j:
0(\boldsymbol{x | |
g | |
i})=0 |
For
t=1,...,T
j=1,2
Set pseudo-labels:
\hat{yi}=\left\{ \begin{array}{ll} yi,1\lei\lem
t-1 | |
\\ sign(g | |
3-j |
(\boldsymbol{x3-j,i
Set virtual distribution:
j(i) | |
D | |
t |
=
1 | ||||||
|
-\hat{yi | |
e |
t-1 | |
g | |
j |
(\boldsymbol{xj,i
j | |
Z | |
t |
=
ne | |
\sum | |
i=1 |
-\hat{yi | |
t-1 | |
g | |
j |
(\boldsymbol{xj,i
j | |
h | |
t |
\alphat
Update the value for current strong non-thresholded classifier:
\forall
t(\boldsymbol{x | |
i:g | |
j,i |
The final strong classifier output is
f(\boldsymbol{x})=
T(\boldsymbol{x | |
sign\left(\sum | |
j})\right) |
CoBoosting builds on the AdaBoost algorithm, which gives CoBoosting its generalization ability since AdaBoost can be used in conjunction with many other learning algorithms. This build up assumes a two class classification task, although it can be adapted to multiple class classification. In the AdaBoost framework, weak classifiers are generated in series as well as a distribution over examples in the training set. Each weak classifier is given a weight and the final strong classifier is defined as the sign of the sum of the weak classifiers weighted by their assigned weight. (See AdaBoost Wikipedia page for notation). In the AdaBoost framework Schapire and Singer have shown that the training error is bounded by the following equation:
1 | |
m |
m | |
\sum | |
i=1 |
| |||||||||||||
e |
)\right)\right)}=\prodtZt
Where
Zt
Dt+1
Zt
Dt(i)
Zt=
\sum | |
i:xt\notinxi |
Dt(i)+
\sum | |
i:xt\inxi |
-yi\alphaiht(\boldsymbol{xi | |
D | |
t(i)e |
)}
Where
xt
W0=
\sum | |
i:ht(xi)=0 |
Dt(i)
W+=
\sum | |
i:ht(xi)=yi |
Dt(i)
W-=
\sum | |
i:ht(xi)=-yi |
Dt(i)
Schapire and Singer have shown that the value
Zt
\alphat
\alphat=
1 | ln\left( | |
2 |
W+ | |
W- |
\right)
Providing confidence values for the current hypothesized classifier based on the number of correctly classified vs. the number of incorrectly classified examples weighted by the distribution over examples. This equation can be smoothed to compensate for cases in which
W-
Zt
Zt=W0+2\sqrt{W+W-}
The training error thus is minimized by selecting the weak hypothesis at every iteration that minimizes the previous equation.
CoBoosting extends this framework in the case where one has a labeled training set (examples from
1...m
m1...n
xi=(x1,i,x2,i)
ZCO
Zt
ZCO=
m | |
\sum | |
i=1 |
-yig1(\boldsymbol{x1,i | |
e |
)} +
m | |
\sum | |
i=1 |
-yig2(\boldsymbol{x2,i | |
e |
)} +
n | |
\sum | |
i=m+1 |
-f2(\boldsymbol{x2,i | |
e |
)g1(\boldsymbol{x1,i
Where
gj
jth
fj
gj
t-1 | |
g | |
j |
jth
t-1
\hat{yi}=\left\{ \begin{array}{ll} yi1\lei\lem
t-1 | |
\\ sign(g | |
3-j |
(\boldsymbol{x3-j,i
In which
3-j
ZCO
ZCO=
2 | |
Z | |
CO |
j | |
Z | |
CO |
=
n | |
\sum | |
i=1 |
-\hat{yi | |
e |
t-1 | |
(g | |
j |
(\boldsymbol{xi})+\alpha
j(\boldsymbol{x | |
j,i |
The distribution over examples for each view
j
t
j(i) | |
D | |
t |
=
1 | ||||||
|
-\hat{yi | |
e |
t-1 | |
g | |
j |
(\boldsymbol{xj,i
At which point
j | |
Z | |
CO |
j | |
Z | |
CO |
=
n | |
\sum | |
i=1 |
j | |
D | |
t |
-\hat{yi | |
e |
j(\boldsymbol{x | |
\alpha | |
j,i |
Which is identical to the equation in AdaBoost. Thus the same process can be used to update the values of
j | |
\alpha | |
t |
\hat{yi}
j | |
D | |
t |
1 | |
Z | |
CO |
2 | |
Z | |
CO |
ZCO