0.5 Plug-in classifier and histogram classifier

Statistical learning theory Page 1 / 2

We return to the topic of classification, and we assume an input (feature) space $X$ and a binary output (label) space $Y = {0, 1}$ . Recall that the Bayes classifier (which minimizes the probability of misclassification) is defined by

f^{*} (x) = \{\begin{matrix} 1, & P (Y = 1 | X = x) \geq 1 / 2 \\ 0, & o t h e r w i s e \end{matrix}) .

Throughout this section, we will denote the conditional probability function by

\begin{matrix} η (x) & \equiv & P (Y = 1 | X = x) \end{matrix} .

Plug-in classifiers

One way to construct a classifier using the training data ${X_{i}, Y_{i}}_{i = 1}^{n}$ is to estimate $η (x)$ and then plug-it into the form of the Bayes classifier. That is obtain an estimate,

{\hat{η}}_{n} (x) = η (x; {X_{i}, Y_{i}}_{i = 1}^{n})

and then form the “plug-in" classification rule

\hat{f} (x) = \{\begin{matrix} 1, & \hat{η} (x) \geq 1 / 2 \\ 0, & o t h e r w i s e \end{matrix}) .

Remark The function

η (x)

is generally more complicated than the ultimate classification rule (binary-valued), as we cansee

\begin{matrix} η & : & X \to [0, 1] \\ f & : & X \to {0, 1} \end{matrix} .

Therefore, in this sense plug-in methods are solving a more complicated problem than necessary. However, plug-in methods can perform well,as demonstrated by the next result.

Theorem

Plug-in classifier

Let $\tilde{η}$ be an approximation to $η$ , and consider the plug-in rule

f (x) = \{\begin{matrix} 1, & \tilde{η} (x) \geq 1 / 2 \\ 0, & o t h e r w i s e \end{matrix}) .

Then,

R (f) - R^{*} \leq 2 E [| η (x) - \tilde{η} (x) |]

where

\begin{matrix} R (f) & = & P (f (X) \neq Y) \\ R^{*} & = & R (f^{*}) = inf_{f} R (f) \end{matrix} .

Consider any $x \in R^{d}$ . In proving the optimality of the Bayes classifier $f^{*}$ in Lecture 2 , we showed that

\begin{matrix} P (f (x) \neq Y | X = x) - P (f^{*}, (x) \neq Y | X = x) = (2 η (x) - 1) [1_{{f^{*} (x) = 1}} - 1_{{f (x) = 1}}], \end{matrix}

which is equivalent to

\begin{matrix} P (f (x) \neq Y | X = x) - P (f^{*}, (x) \neq Y | X = x) = |2 η (x) - 1| 1_{{f^{*} (x) \neq f (x)}}, \end{matrix}

since $f^{*} (x) = 1$ whenever $2 η (x) - 1 > 0$ . Thus,

\begin{matrix} P (f (X) \neq Y) - R^{*} & = & \int_{R^{d}} 2 | η (x) - 1 / 2 | 1_{{f^{*} (x) \neq f (x)}} p_{X} (x) d x \\ where p_{X} (x) is the marginal density of X \\ \leq & \int_{R^{d}} 2 | η (x) - \tilde{η} (x) | 1_{{f^{*} (x) \neq f (x)}} p_{X} (x) d x \\ \leq & \int_{R^{d}} 2 | η (x) - \tilde{η} (x) | p_{X} (x) d x \\ = & 2 E [| η (X) - \tilde{η} (X) |] \end{matrix}

where the first inequality follows from the fact

\begin{matrix} f (x) \neq f^{*} (x) & \Rightarrow & | η (x) - \tilde{η} (x) | \geq | η (x) - 1 / 2 | \end{matrix}

and the second inequality is simply a result of the fact that $1_{{f^{*} (x) \neq f (x)}}$ is either 0 or 1.

Pictorial illustration of $| η (x) - \tilde{η} (x) | \geq | η (x) - 1 / 2 |$ when $f (x) \neq f^{*} (x)$ . Note that the inequality $P (f (X) \neq Y) - R^{*} \leq \int_{R^{d}} 2 | η (x) - \tilde{η} (x) | 1_{{f^{*} (x) \neq f (x)}} p_{X} (x) d x$ shows that the excess risk is at most twice the integral over the setwhere $f^{*} (x) \neq f (x)$ . The difference $| η (x) - \tilde{η} (x) |$ may be arbitrarily large away from this set without effecting the error rate of the classifier. Thisillustrates the fact that estimating $η$ well everywhere (i.e., regression) is unnecessary for the design of a good classifier (weonly need to determine where $η$ crosses the $1 / 2$ -level). In other words, “classification is easier than regression.”

The theorem shows us that a good estimate of $η$ can produce a good plug-in classification rule. By “good" estimate, we mean an estimator $\tilde{η}$ that is close to $η$ in expected $L_{1} -norm$ .

The histogram classifier

Let's assume that the (input) features are randomly distributed over theunit hypercube $X = {[0, 1]}^{d}$ (note that by scaling and shifting any set of bounded features we can satisfy this assumption),and assume that the (output) labels are binary, i.e., $Y = {0, 1}$ . A histogram classifier is based on a partition the hypercube ${[0, 1]}^{d}$ into $M$ smaller cubes of equal size.

Partition of hypercube in 2 dimensions

Consider the unit square ${[0, 1]}^{2}$ and partition it into $M$ subsquares of equal area (assuming $M$ is a squared integer). Let the subsquares be denoted by ${Q_{i}}, i = 1, ..., M$ .

Example of hypercube ${[0, 1]}^{2}$ in $M$ equally sized partition

Define the following piecewise-constant estimator of $η (x)$ :

{\hat{η}}_{n} (x) = \sum_{j = 1}^{M} {\hat{P}}_{j} 1_{{x \in Q_{j}}}

where

{\hat{P}}_{j} = \frac{\sum_{i = 1}^{n} 1_{{X_{i} \in Q_{j}, Y_{i} = 1}}}{\sum_{i = 1}^{n} 1_{{X_{i} \in Q_{j}}}} .

Like our previous denoising examples, we expect that the bias of ${\hat{η}}_{n}$ will decrease as $M$ increases, but the variance will increase as $M$ increases.

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	2 Business Law MCQ 2 By Maureen Miller Start Exam
	16 Dr. Amberg Pharm quiz 2 By Brooke Delaney Start Exam
	Theater History Final - Review By Cameron Casey Start Exam
	1 Endocrine System MCQ By Nick Swain Start Quiz
	1 Gastrointestinal Pathophysiology By Laurence Bailen Start Exam
	English Vocabulary By Jordon Humphreys Start Quiz
	Business Statistics By David Bourgeois Start Quiz
	1 Week 1 Social Psych By Yacoub Jayoghli Start Quiz
©flickr: quinn	Neuroanatomy By George Turner Start Quiz
	SCJP Online Exam 310-065 By Prateek Ashtikar Start Quiz