<< Chapter < Page Chapter >> Page >
Theorem

Consistency of histogram classifiers

If M and n M as n , then the histogram classifier risk converges to the Bayes risk for every distribution P X Y with marginal density p X ( x ) c , for some constant c > 0 . Actually, the result holds for every distribution P X Y . For the more general theorem, refer to Theorem 6 . 1 in A probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi and Gábor Lugosi. .

What the theorem tells us is that we need the number of partition cells to tend to infinity (to insure that the bias tends to zero), but they can'tgrow faster than the number of samples ( i.e., we want the number of samples per box tending to infinity to drive the variance to zero).

Let P j Q j η ( x ) p X ( x ) d x Q j p X ( x ) d x (the theoretical analog of P ^ j ) and define

η ¯ ( x ) = j = 1 M P j 1 { x Q j }

The function η ¯ is the theoretical analog of η ^ (i.e., the function obtained by averaging η over the partition cells). By the triangle inequality,

E | η ^ n ( X ) - η ( X ) | E [ | η ^ n ( X ) - η ¯ ( X ) | ] E s t i m a t i o n E r r o r + E [ | η ¯ n ( X ) - η ( X ) | ] A p p r o x i m a t i o n E r r o r

Let's first bound the estimation error. For any x [ 0 , 1 ] d , let Q ( x ) denote the histogram bin in which x falls in. Define the random variable

N ( x ) = i = 1 n 1 { X i Q ( x ) }

If Q ( x ) = Q j , then this random variable is simply n P ^ j . Note that

η ^ n ( x ) = 1 N ( x ) B ( x )

where B ( x ) = = i = 1 n 1 { X i Q ( x ) , Y i = 1 } = i : X i Q ( x ) Y i . B ( x ) is simply the number of samples in cell Q ( x ) labelled 1. Now η ^ n ( x ) is a fairly complicated random variable, but the conditional distribution of B ( x ) given N ( x ) is relatively simple. Note that

B ( x ) | N ( x ) = k Binomial k , η ¯ ( x )

since η ¯ ( x ) is the probability of a sample in Q ( x ) having the label 1 and we are conditioning on the event of observing k samples in Q ( x ) .

Now consider the conditional expectation

E η ^ n ( x ) - η ¯ ( x ) N ( x ) = k E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k , k > 0 1 , k = 0 ( since 0 η ¯ ( x ) 1 )

Next note that

E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k = E B ( x ) k - η ¯ ( x ) N ( x ) = k = E 1 k | B ( x ) - k η ¯ ( x ) E [ B ( x ) ] | N ( x ) = k 1 k ( E | B ( x ) - k η ¯ ( x ) | 2 N ( x ) = k conditional variance of B ( x ) ) 1 2

by the Jensen's inequality, E [ | Z | ] ( E [ | Z | 2 ] ) 1 2 .

Therefore,

E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k 1 k k η ¯ ( x ) ( 1 - η ¯ ( x ) ) 1 2 = η ¯ ( x ) ( 1 - η ¯ ( x ) ) k

and

E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) k , k > 0 1 , k = 0

or in other words,

E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + 1 { N ( x ) = 0 }

Now taking expectation with respect to N ( x )

E N E [ | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k ] E N η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) E 1 2 N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) 1 2 P ( N ( x ) k ) + 1 2 k P ( N ( x ) > k ) 1 + P ( N ( x ) = 0 )

Now a key fact is that for any k > 0 , P ( N k ) 0 as n . This follows from the assumption that the marginal density p X ( x ) c , for some constant c > 0 , and n M as n . This result is easily verified by contradiction. If P ( N k ) q > 0 as n , then P X ( x ) > 0 is contradicted. Thus, for any ϵ > 0 there exists a k > 0 such that 1 2 k < ϵ and P ( N k ) < ϵ for n sufficiently large. Therefore, for n sufficiently large and every x [ 0 , 1 ] d ,

E | η ^ n ( x ) - η ¯ ( x ) | < 3 ϵ

where the expectation is with respect to the distribution of the sample { X i , Y i } i = 1 n . Thus,

E | η ^ n ( X ) - η ¯ ( X ) | < 3 ϵ

where the expectation is now with respect to the distribution of the sampleand the marginal distribution of X .

Next consider the approximation error E | η ¯ n ( X ) - η ( X ) | , where the expectation is over X alone. The function η may not itself be continuous, but there is another function η ϵ that is uniformly continuous and such that E | η ϵ ( X ) - η ( X ) | < ϵ . Recall that uniformly continuous functions can be well approximated by piecewiseconstant functions.

By the triangle inequality,

E | η ¯ - η | E | η ¯ - η ¯ ϵ | ϵ + E | η ¯ ϵ - η ϵ | + E | η ϵ - η | ϵ by design

where η ¯ ϵ ( x ) = j = 1 m Q j η ϵ ( x ' ) p X ( x ' ) d x ' 1 { x Q j } .

E | η ¯ ( X ) - η ¯ ϵ ( X ) | = j = 1 m Q j | η ( x ) - η ϵ ( x ) | p X ( x ) d x 1 { x Q j } ϵ

and since η ϵ is uniformly continuous,

E | η ¯ ϵ ( X ) - η ϵ ( X ) | = j = 1 M Q j | η ¯ ϵ ( x ) - η ϵ ( x ) | 1 { x Q j } p X ( x ) d x j = 1 M δ P ( x Q j ) , where δ depends on M = δ , since j = 1 M P ( X Q j ) = 1

By taking M sufficiently large, δ can be made arbitrarily small. So for large M , δ ϵ .

Thus, we have shown

E | η ¯ ( X ) - η ( X ) | < 3 ϵ

for sufficiently large M . Since ϵ > 0 was arbitrary, we have shown that taking

f ^ n ( x ) = 1 , η ^ n ( x ) 1 / 2 0 , o t h e r w i s e

satisfies

P ( f ^ n ( X ) Y ) - P ( f * ( X ) Y ) 2 E | η ^ n ( X ) - η ( X ) | 0

if

M n M as n
P ( f ^ n ( X ) Y ) = E [ 1 { f ^ ( X ) Y } ] is the expected risk of f ^ , with expectation over the distributions of ( X , Y ) and { X i , Y i } i = 1 n .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask