<< Chapter < Page Chapter >> Page >

Review: pac bounds

Consider a finite collection of models F , and recall the basic PAC bound: for any δ > 0 , with probability at least 1 - δ

R ( f ) R ^ n ( f ) + log | F | + log ( 1 / δ ) 2 n , f F

where

R ^ n ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) R ( f ) = E ( f ( X ) , Y )

and the loss is assumed to be bounded between 0 and 1. Note that we can write the inequality above as:

R ( f ) R ^ n ( f ) + log | F | δ 2 n

Letting δ f = δ | F | , we have:

R ( f ) R ^ n ( f ) + log ( 1 / δ f ) 2 n

This is precisely the form of Hoeffding's inequality, with δ f in place of the usual δ . In effect, in order to have Hoeffding's inequality hold with probability 1 - δ for all f F , we must distribute the “ δ -budget” or “confidence-budget” over all f F (in this case, evenly distributed):

f F δ f = f F δ | F | = δ

However, to apply the union bound, we do not need to distribute δ evenly among the candidate models. We only require:

f F δ f = δ

So, if p ( f ) are positive numbers satisfying f F p ( f ) = 1 , then we can take δ f = p ( f ) δ . This provides two advantages:

  1. By choosing p ( f ) larger for certain f , we can preferentially treat those candidates
  2. We do not need F to be finite and we only require f F p ( f ) = 1

Prefix codes are one way to achieve this. If we assign a binary prefix code of length c ( f ) to each f F , then the values p ( f ) = 2 - c ( f ) satisfy f F p ( f ) 1 according to the Kraft inequality.

The main point of this lecture is to examine how PAC bounds of the form w.p. 1 - δ

R ( f ) R ^ n ( f ) + c ( f ) log 2 + log ( 1 / δ ) 2 n , f F

can be used to select a model that comes close to achieving the best possible performace

inf f F R ( f )

Let f ^ n be the model selected from F using the training data { X i , Y i } i = 1 n . We will specify this model in a moment, but keep in mind that it is notnecessarily the model with minimum empirical risk as before. We would like to have

E [ R ( f ^ n ) ] - inf f F R ( f )

as small as possible. First, for any δ > 0 , define

f ^ n δ = arg min f F R ^ n ( f ) + C ( f , n , δ )

where

C ( f , n , δ ) c ( f ) log 2 + log ( 1 / δ ) 2 n

Then w.p. 1 - δ

R ( f ) R ^ n ( f ) + C ( f , n , δ ) , f F

and in particular,

R ( f ^ n δ ) R ^ n ( f ^ n δ ) + C ( f ^ n δ , n , δ ) ,

so, by the definition of f ^ n δ , f F

R ( f ^ n δ ) R ^ n ( f ) + C ( f , n , δ ) .

We will make use of the inequality above in a moment. First note that f F

E [ R ( f ^ n δ ) ] - R ( f ) = E [ R ( f ^ n δ ) - R ^ n ( f ) ] + E [ R ^ n ( f ) - R ( f ) ]

The second term is exactly 0, since E [ R ^ n ( f ) ] = R ( f ) .

Now consider the first term E [ R ( f ^ n δ ) - R ^ n ( f ) ] . Let Ω be the set of events on which

R ( f ^ n δ ) R ^ n ( f ) - C ( f , n , δ ) , f F

From the bound above, we know that P ( Ω ) 1 - δ . Thus,

E [ R ( f ^ n δ ) - R ^ n ( f ) ] = E [ R ( f ^ n δ ) - R ^ n ( f ) | Ω ] P ( Ω ) + E [ R ( f ^ n δ ) - R ^ n ( f ) | Ω c ] ( 1 - P ( Ω ) ) C ( f , n , δ ) + δ ( since 0 R , R ^ 1 , P ( Ω ) 1 and 1 - P ( Ω ) δ ) = c ( f ) log 2 + log ( 1 / δ ) 2 n + δ = c ( f ) log 2 + 1 2 log n 2 n + 1 n ( by setting δ = 1 n )

We can summarize our analysis with the following theorem.

Theorem

Complexity regularized model selection

Let F be a countable collection of models, and assign a positive number c ( f ) to each f F such that f F 2 - c ( f ) 1 . Define the minimum complexity regularized risk model

f ^ n = arg min f F R ^ n ( f ) + c ( f ) log 2 + 1 2 log n 2 n

Then,

E [ R ( f ^ n ) ] inf f F R ( f ) + c ( f ) log 2 + 1 2 log n 2 n + 1 n

This shows that

R ^ n ( f ) + c ( f ) log 2 + 1 2 log n 2 n

is a reasonable surrogate for

R ( f ) + c ( f ) log 2 + 1 2 log n 2 n

Histogram classifiers

Let X = [ 0 , 1 ] d be the input space and Y = { 0 , 1 } be the output space. Let F k , k = 1, 2, ...  denotes the collection of histogram classification rules withk equal volume bins. One choice of prefix code for this example is: k = 1 code = 0 , k = 3 code = 10 , k = 3 code = 110 and so on .... Then, if first code is corresponding to k f F k , followed by k = log 2 | F k | bits to indicate which of the 2 k histogram rules in F k is under consideration, we have

f F k c ( f ) = 2 k b i t s

Let f ^ n be the model that solves the minimization i.e.,

min k 1 min f F k R ^ n ( f ) + 2 k log 2 + 1 2 log n 2 n

That is, for each k, let

f ^ n ( k ) = arg min f F k R ^ n ( f )

Then select the best k according to

k ^ = arg min k 1 R ^ n ( f ^ n ( k ) ) + 2 k log 2 + 1 2 log n 2 n

and set

f ^ n = f ^ n ( k ^ )

Then,

E [ R ( f ^ n ) ] inf k 1 min f F k R ( f ) + 2 k log 2 + 1 2 log n 2 n + 1 n

It is a simple exercise to show that if d = 2 and the Bayes decision boundary is a 1-d curve, then by setting k = n and selecting the best f from F n we have

E [ R ( f ^ n ) ] = O ( n - 1 / 4 )
The complexity regularized classifier f ^ n adaptively achieves this rate, without user intervention.

Questions & Answers

what is biology
Hajah Reply
the study of living organisms and their interactions with one another and their environments
AI-Robot
what is biology
Victoria Reply
HOW CAN MAN ORGAN FUNCTION
Alfred Reply
the diagram of the digestive system
Assiatu Reply
allimentary cannel
Ogenrwot
How does twins formed
William Reply
They formed in two ways first when one sperm and one egg are splited by mitosis or two sperm and two eggs join together
Oluwatobi
what is genetics
Josephine Reply
Genetics is the study of heredity
Misack
how does twins formed?
Misack
What is manual
Hassan Reply
discuss biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles
Joseph Reply
what is biology
Yousuf Reply
the study of living organisms and their interactions with one another and their environment.
Wine
discuss the biological phenomenon and provide pieces of evidence to show that it was responsible for the formation of eukaryotic organelles in an essay form
Joseph Reply
what is the blood cells
Shaker Reply
list any five characteristics of the blood cells
Shaker
lack electricity and its more savely than electronic microscope because its naturally by using of light
Abdullahi Reply
advantage of electronic microscope is easily and clearly while disadvantage is dangerous because its electronic. advantage of light microscope is savely and naturally by sun while disadvantage is not easily,means its not sharp and not clear
Abdullahi
cell theory state that every organisms composed of one or more cell,cell is the basic unit of life
Abdullahi
is like gone fail us
DENG
cells is the basic structure and functions of all living things
Ramadan
What is classification
ISCONT Reply
is organisms that are similar into groups called tara
Yamosa
in what situation (s) would be the use of a scanning electron microscope be ideal and why?
Kenna Reply
A scanning electron microscope (SEM) is ideal for situations requiring high-resolution imaging of surfaces. It is commonly used in materials science, biology, and geology to examine the topography and composition of samples at a nanoscale level. SEM is particularly useful for studying fine details,
Hilary
cell is the building block of life.
Condoleezza Reply
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask