<< Chapter < Page Chapter >> Page >

We return to the topic of classification, and we assume an input (feature) space X and a binary output (label) space Y = { 0 , 1 } . Recall that the Bayes classifier (which minimizes the probability of misclassification) is defined by

f * ( x ) = 1 , P ( Y = 1 | X = x ) 1 / 2 0 , o t h e r w i s e .

Throughout this section, we will denote the conditional probability function by

η ( x ) P ( Y = 1 | X = x ) .

Plug-in classifiers

One way to construct a classifier using the training data { X i , Y i } i = 1 n is to estimate η ( x ) and then plug-it into the form of the Bayes classifier. That is obtain an estimate,

η ^ n ( x ) = η ( x ; { X i , Y i } i = 1 n )

and then form the “plug-in" classification rule

f ^ ( x ) = 1 , η ^ ( x ) 1 / 2 0 , o t h e r w i s e .
The function η ( x ) is generally more complicated than the ultimate classification rule (binary-valued), as we cansee
η : X [ 0 , 1 ] f : X { 0 , 1 } .

Therefore, in this sense plug-in methods are solving a more complicated problem than necessary. However, plug-in methods can perform well,as demonstrated by the next result.

Theorem

Plug-in classifier

Let η ˜ be an approximation to η , and consider the plug-in rule

f ( x ) = 1 , η ˜ ( x ) 1 / 2 0 , o t h e r w i s e .

Then,

R ( f ) - R * 2 E [ | η ( x ) - η ˜ ( x ) | ]

where

R ( f ) = P ( f ( X ) Y ) R * = R ( f * ) = inf f R ( f ) .

Consider any x R d . In proving the optimality of the Bayes classifier f * in Lecture 2 , we showed that

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) = 1 } - 1 { f ( x ) = 1 } ,

which is equivalent to

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) f ( x ) } ,

since f * ( x ) = 1 whenever 2 η ( x ) - 1 > 0 . Thus,

P ( f ( X ) Y ) - R * = R d 2 | η ( x ) - 1 / 2 | 1 { f * ( x ) f ( x ) } p X ( x ) d x where p X ( x ) is the marginal density of X R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x R d 2 | η ( x ) - η ˜ ( x ) | p X ( x ) d x = 2 E [ | η ( X ) - η ˜ ( X ) | ]

where the first inequality follows from the fact

f ( x ) f * ( x ) | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 |

and the second inequality is simply a result of the fact that 1 { f * ( x ) f ( x ) } is either 0 or 1.

Pictorial illustration of | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 | when f ( x ) f * ( x ) . Note that the inequality P ( f ( X ) Y ) - R * R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x shows that the excess risk is at most twice the integral over the setwhere f * ( x ) f ( x ) . The difference | η ( x ) - η ˜ ( x ) | may be arbitrarily large away from this set without effecting the error rate of the classifier. Thisillustrates the fact that estimating η well everywhere (i.e., regression) is unnecessary for the design of a good classifier (weonly need to determine where η crosses the 1 / 2 -level). In other words, “classification is easier than regression.”

The theorem shows us that a good estimate of η can produce a good plug-in classification rule. By “good" estimate, we mean an estimator η ˜ that is close to η in expected L 1 -norm .

The histogram classifier

Let's assume that the (input) features are randomly distributed over theunit hypercube X = [ 0 , 1 ] d (note that by scaling and shifting any set of bounded features we can satisfy this assumption),and assume that the (output) labels are binary, i.e., Y = { 0 , 1 } . A histogram classifier is based on a partition the hypercube [ 0 , 1 ] d into M smaller cubes of equal size.

Partition of hypercube in 2 dimensions

Consider the unit square [ 0 , 1 ] 2 and partition it into M subsquares of equal area (assuming M is a squared integer). Let the subsquares be denoted by { Q i } , i = 1 , ... , M .

Example of hypercube [ 0 , 1 ] 2 in M equally sized partition

Define the following piecewise-constant estimator of η ( x ) :

η ^ n ( x ) = j = 1 M P ^ j 1 { x Q j }

where

P ^ j = i = 1 n 1 { X i Q j , Y i = 1 } i = 1 n 1 { X i Q j } .

Like our previous denoising examples, we expect that the bias of η ^ n will decrease as M increases, but the variance will increase as M increases.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask