<< Chapter < Page Chapter >> Page >

This implies that for any δ > 0 with probability at least 1 - δ we have f F

R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) log 2 + log 1 δ 2 n .

Application

Let F 1 , F 1 , ... be a sequence of finite sets of candidate functions with | F 1 | < | F 1 | < . . . We can design prefix codes as follows. Use the codes 0, 10, 110, 1110, ... to encode thesubscript i in | F i | . For each class | F i | , construct a set of binary codewords of length log 2 | F | to uniquely encode each function in F i . Then, encode any given function f by first using the code for i corresponding to the smallest F i that f belongs to, followed by the length log 2 | F | codeword for f F i . This is a prefix code.

Histogram classifiers

X=[0,1] d , Y={0,1}. Let F k , k=1, 2, ... denote the collection of histogram classification rules with k equal volumebins. We can use the following codebook for the index k.

And follow this codeword with k = log 2 | F k | bits to indicate which of the 2 k possible histogram rules is under consideration. Thus for any f F k for some k 1 there is a prefix code of length

c ( f ) = k + k = 2 k b i t s .

It follows that for any δ > 0 with probability at least 1 - δ we have f k 1 F k

R ( f ) R ^ n ( f ) + 2 k f log 2 + log 1 δ 2 n

where k f is the number of bins in histogram corresponding to f . Contrast with the bound we had for the class of m bin histograms alone: with probability 1 - δ , f F m

R ( f ) R ^ n ( f ) + m log 2 + log 1 δ ( f ) 2 n .

Notice the bound for all histograms rules is almost as good as the bound for only the m -bin rules. That is, when k f = m the bounds are within a factor of 2 . On the other hand, the new bound is a big improvement, since it also gives us a guide for selecting thenumber of bins.

Proof

Proof of the kraft inequality

We will prove that for any binary prefix code, the codeword lengths c 1 , c 2 , ... satisfy k 1 2 - c k 1 . The converse is easy to prove also, but it not central to ourpurposes here (for a proof, see Cover & Thomas '91). Consider a binary tree like the one shown below.

The sequence of bit values leading from the root to a leaf of the tree represents a codeword. The prefix condition implies that nocodeword is a descendant of any other codeword in the tree. Let c m a x be the length of the longest codeword (also the number of branches to the deepest leaf) in the tree.

Consider a leaf i in the tree at level c i . This leaf would have 2 c m a x - c i descendants at level c m a x . Furthermore, for each leaf the set of possible descendants at level c m a x is disjoint (since no codeword can be a prefix of another). Therefore,since the total number of possible leafs at level c m a x is 2 c m a x , we have

i leafs 2 c m a x - c i 2 c m a x i leafs 2 - c i 1

which proves the case when the number of codewords is finite.

Suppose now that we have a countably infinite number of codewords. Let b 1 b 2 ... b c i be the ith codeword and let

r i = j = i c i b j 2 - j

be the real number corresponding to the binary expansion of the codeword. We canassociate the interval [ r i , r i + 2 - c i ) with the ith codeword. This is the set of all real numbers whose binaryexpansion begins with b 1 b 2 ... b c i . Since this is a subinterval of [ 0 , 1 ] , and all such subintervals corresponding to prefix codewords are disjoint, the sum of their lengths must beless than or equal to 1. This proves the case where the number of codewords is infinite.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask