<< Chapter < Page Chapter >> Page >

Introduction

In the last lecture , we studied bounds of the following form: for any δ > 0 , with probability at least 1 - δ ,

R ( f ) R ^ n ( f ) + log | F | + log 1 δ 2 n , f F

which led to upper bounds on the estimation error of the form

E [ R ( f ^ n ) ] - min f F R ( f ) log | F | + log ( n ) + 2 n .

The key assumptions made in deriving the error bounds were:

  • bounded loss function
  • finite collection of candidate functions

The bounds are valid for every P X Y and are called distribution-free.

Deriving bounds for countably infinite spaces

In this lecture we will generalize the previous results in a powerful way by developing bounds applicable to possibly infinitecollections of candidates. To start let us suppose that F is a countable, possibly infinite, collection of candidate functions.Assign a positive number c( f ) to each f F , such that

f F e - c ( f ) < .

The numbers c( f ) can be interpreted as

  • measures of complexity
  • -log of prior probabilities
  • codelengths

In particular, if P( f ) is the prior probability of f then

e - - log p ( f ) = p ( f )

so c ( f ) - log p ( f ) produces

f F e - c ( f ) = f F p ( f ) = 1 .

Now recall Hoeffding's inequality. For each f and every ϵ > 0

P R ( f ) - R ^ n ( f ) ϵ e - 2 n ϵ 2

or for every δ > 0

P R ( f ) - R ^ n ( f ) log 1 δ 2 n δ .

Suppose δ > 0 is specified. Using the values c( f ) for f F , define

δ ( f ) = e - c ( f ) δ .

Then we have

P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n δ ( f ) .

Furthermore we can apply the union bound as follows

P sup f F R ( f ) - R ^ n ( f ) - log ( 1 / δ ( f ) ) 2 n 0 P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F e - c ( f ) δ = δ .

So for any δ > 0 with probability at least 1 - δ , we have that f F

R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) + log 1 δ 2 n .

Special case

Suppose F is finite and c ( f ) = log | F | f F . Then

f F e - c ( f ) = f F e - log | F | = f F 1 | F | = 1

and

δ ( f ) = δ | F |

which implies that for any δ > 0 with probability at least 1 - δ , we have

R ( f ) R ^ n ( f ) + log | F | + log 1 δ ( f ) 2 n , f F .

Note that this is precisely the bound we derived in the last lecture .

Choosing c( f )

The generalized bounds allow us to handle countably infinite collections of candidate functions, but we require that

f F e - c ( f ) < .

Of course, if c ( f ) = - log p ( f ) where p ( f ) is a proper prior probability distribution then we have

f F e - c ( f ) = 1 .

However, it may be difficult to design a probability distribution over an infinite class of candidates. The coding perspectiveprovides a very practical means to this end.

Assume that we have assigned a uniquely decodable binary code to each f F , and let c( f ) denote the codelength for f . That is, the code for f is c( f ) bits long. A very useful class of uniquely decodable codes are called prefix codes.

Prefix Code
A code is called a prefix codeif no codeword is a prefix of any other codeword.

The kraft inequality

For any binary prefix code, the codeword lengths c 1 , c 2 , ... satisfy

i = 1 2 - c i 1 .

Conversely, given any c 1 , c 2 , ... satisfying the inequality above we can construct a prefix code with these codeword lengths.We will prove this result a bit later, but now let's see how this is useful in our learning problem.

Assume that we have assigned a binary prefix codeword to each f F , and let c( f ) denote the bit-length of the codeword for f . Set δ ( f ) = 2 - c ( f ) δ . Then

P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F 2 - c ( f ) δ = δ .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask