<< Chapter < Page Chapter >> Page >

Review: pac bounds

Consider a finite collection of models F , and recall the basic PAC bound: for any δ > 0 , with probability at least 1 - δ

R ( f ) R ^ n ( f ) + log | F | + log ( 1 / δ ) 2 n , f F

where

R ^ n ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) R ( f ) = E ( f ( X ) , Y )

and the loss is assumed to be bounded between 0 and 1. Note that we can write the inequality above as:

R ( f ) R ^ n ( f ) + log | F | δ 2 n

Letting δ f = δ | F | , we have:

R ( f ) R ^ n ( f ) + log ( 1 / δ f ) 2 n

This is precisely the form of Hoeffding's inequality, with δ f in place of the usual δ . In effect, in order to have Hoeffding's inequality hold with probability 1 - δ for all f F , we must distribute the “ δ -budget” or “confidence-budget” over all f F (in this case, evenly distributed):

f F δ f = f F δ | F | = δ

However, to apply the union bound, we do not need to distribute δ evenly among the candidate models. We only require:

f F δ f = δ

So, if p ( f ) are positive numbers satisfying f F p ( f ) = 1 , then we can take δ f = p ( f ) δ . This provides two advantages:

  1. By choosing p ( f ) larger for certain f , we can preferentially treat those candidates
  2. We do not need F to be finite and we only require f F p ( f ) = 1

Prefix codes are one way to achieve this. If we assign a binary prefix code of length c ( f ) to each f F , then the values p ( f ) = 2 - c ( f ) satisfy f F p ( f ) 1 according to the Kraft inequality.

The main point of this lecture is to examine how PAC bounds of the form w.p. 1 - δ

R ( f ) R ^ n ( f ) + c ( f ) log 2 + log ( 1 / δ ) 2 n , f F

can be used to select a model that comes close to achieving the best possible performace

inf f F R ( f )

Let f ^ n be the model selected from F using the training data { X i , Y i } i = 1 n . We will specify this model in a moment, but keep in mind that it is notnecessarily the model with minimum empirical risk as before. We would like to have

E [ R ( f ^ n ) ] - inf f F R ( f )

as small as possible. First, for any δ > 0 , define

f ^ n δ = arg min f F R ^ n ( f ) + C ( f , n , δ )

where

C ( f , n , δ ) c ( f ) log 2 + log ( 1 / δ ) 2 n

Then w.p. 1 - δ

R ( f ) R ^ n ( f ) + C ( f , n , δ ) , f F

and in particular,

R ( f ^ n δ ) R ^ n ( f ^ n δ ) + C ( f ^ n δ , n , δ ) ,

so, by the definition of f ^ n δ , f F

R ( f ^ n δ ) R ^ n ( f ) + C ( f , n , δ ) .

We will make use of the inequality above in a moment. First note that f F

E [ R ( f ^ n δ ) ] - R ( f ) = E [ R ( f ^ n δ ) - R ^ n ( f ) ] + E [ R ^ n ( f ) - R ( f ) ]

The second term is exactly 0, since E [ R ^ n ( f ) ] = R ( f ) .

Now consider the first term E [ R ( f ^ n δ ) - R ^ n ( f ) ] . Let Ω be the set of events on which

R ( f ^ n δ ) R ^ n ( f ) - C ( f , n , δ ) , f F

From the bound above, we know that P ( Ω ) 1 - δ . Thus,

E [ R ( f ^ n δ ) - R ^ n ( f ) ] = E [ R ( f ^ n δ ) - R ^ n ( f ) | Ω ] P ( Ω ) + E [ R ( f ^ n δ ) - R ^ n ( f ) | Ω c ] ( 1 - P ( Ω ) ) C ( f , n , δ ) + δ ( since 0 R , R ^ 1 , P ( Ω ) 1 and 1 - P ( Ω ) δ ) = c ( f ) log 2 + log ( 1 / δ ) 2 n + δ = c ( f ) log 2 + 1 2 log n 2 n + 1 n ( by setting δ = 1 n )

We can summarize our analysis with the following theorem.

Theorem

Complexity regularized model selection

Let F be a countable collection of models, and assign a positive number c ( f ) to each f F such that f F 2 - c ( f ) 1 . Define the minimum complexity regularized risk model

f ^ n = arg min f F R ^ n ( f ) + c ( f ) log 2 + 1 2 log n 2 n

Then,

E [ R ( f ^ n ) ] inf f F R ( f ) + c ( f ) log 2 + 1 2 log n 2 n + 1 n

This shows that

R ^ n ( f ) + c ( f ) log 2 + 1 2 log n 2 n

is a reasonable surrogate for

R ( f ) + c ( f ) log 2 + 1 2 log n 2 n

Histogram classifiers

Let X = [ 0 , 1 ] d be the input space and Y = { 0 , 1 } be the output space. Let F k , k = 1, 2, ...  denotes the collection of histogram classification rules withk equal volume bins. One choice of prefix code for this example is: k = 1 code = 0 , k = 3 code = 10 , k = 3 code = 110 and so on .... Then, if first code is corresponding to k f F k , followed by k = log 2 | F k | bits to indicate which of the 2 k histogram rules in F k is under consideration, we have

f F k c ( f ) = 2 k b i t s

Let f ^ n be the model that solves the minimization i.e.,

min k 1 min f F k R ^ n ( f ) + 2 k log 2 + 1 2 log n 2 n

That is, for each k, let

f ^ n ( k ) = arg min f F k R ^ n ( f )

Then select the best k according to

k ^ = arg min k 1 R ^ n ( f ^ n ( k ) ) + 2 k log 2 + 1 2 log n 2 n

and set

f ^ n = f ^ n ( k ^ )

Then,

E [ R ( f ^ n ) ] inf k 1 min f F k R ( f ) + 2 k log 2 + 1 2 log n 2 n + 1 n

It is a simple exercise to show that if d = 2 and the Bayes decision boundary is a 1-d curve, then by setting k = n and selecting the best f from F n we have

E [ R ( f ^ n ) ] = O ( n - 1 / 4 )
The complexity regularized classifier f ^ n adaptively achieves this rate, without user intervention.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask