Bayes classifier explained

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.[1]

Definition

Suppose a pair

(X,Y)

takes values in

Rd x \{1,2,...,K\}

, where

Y

is the class label of an element whose features are given by

X

. Assume that the conditional distribution of X, given that the label Y takes the value r is given by(X\mid Y=r) \sim P_r \quad \text \quad r=1,2,\dots,Kwhere "

\sim

" means "is distributed as", and where

Pr

denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function

C:Rd\to\{1,2,...,K\}

, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as\mathcal(C) = \operatorname\.

The Bayes classifier isC^\text(x) = \underset \operatorname(Y=r \mid X=x).

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,

\operatorname{P}(Y=r\midX=x)

. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier

C

(possibly depending on some training data) is defined as

l{R}(C)-l{R}(CBayes).

Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]

Considering the components

xi

of

x

to be mutually independent, we get the naive Bayes classifier, where C^\text(x) = \underset \operatorname(Y=r)\prod_^P_r(x_i).

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk

R(h)

, Bayes risk

R*

, all possible classes to which the points can be classified

Y=\{0,1\}

. Let the posterior probability of a point belonging to class 1 be

η(x)=Pr(Y=1|X=x)

. Define the classifier

l{h}*

as \mathcal^*(x)=\begin1&\text\eta(x)\geqslant 0.5,\\ 0&\text\end

Then we have the following results:

Proof of (a): For any classifier

h

, we have\beginR(h) &= \mathbb_\left[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] \\&=\mathbb_X\mathbb_[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] \\&= \mathbb_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ]\end where the second line was derived through Fubini's theorem

Notice that

R(h)

is minimised by taking

\forallx\inX

,h(x) = \begin1&\text\eta(x)\geqslant 1-\eta(x),\\0&\text\end

Therefore the minimum possible risk is the Bayes risk,

R*=R(h*)

.

Proof of (b): \beginR(h)-R^* &= R(h)-R(h^*)\\ &= \mathbb_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\ &=\mathbb_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\ &= 2\mathbb_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\end

Proof of (c):\beginR(h^*) &= \mathbb_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\&= \mathbb_X[\min(\eta(X),1-\eta(X))]\end

Proof of (d): \beginR(h^*) &= \mathbb_X[\min(\eta(X),1-\eta(X))] \\&= \frac - \mathbb_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\&=\frac - \frac \mathbb E [|2\eta(X) - 1|]\end

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.\begin\mathbb_Y(\mathbb_) &= \mathbb_X\mathbb_\left(\mathbb_|X=x\right)\\&= \mathbb \left[\Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}} + \Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}} + \dots + \Pr(Y=n|X=x) \mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right]\end

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier

h(x)=k,\argmaxkPr(Y=k|X=x)

for each observation x.

See also

Notes and References

  1. Book: Devroye, L. . Gyorfi, L. . Lugosi, G. . amp . 1996 . A probabilistic theory of pattern recognition. Springer . 0-3879-4618-7.
  2. Strong universal consistency of neural network classifiers. 10.1109/18.243433 . 1993. Farago. A.. Lugosi. G.. IEEE Transactions on Information Theory. 39. 4. 1146–1151.