<< Chapter < Page Chapter >> Page >

The vapnik-chervonenkis inequality

The VC inequality is a powerful generalization of the bounds we obtained for the hyperplane classifier in the previous lecture . The basic idea of the proof is quite similar. Before starting the inequality, we need to introduce theconcept of shatter coefficients and VC dimension.

Shatter coefficients

Let A be a collection of subsets of R d , definition: The n t h shatter coefficient of A is defined by

S A ( n ) = m a x x 1 , ... , x n ϵ R d { { x 1 , ... , x n } A , A ϵ A } .

The shatter coefficients are a measure of the richness of the collection A . S A ( n ) is the largest number of different subsets of a set of n points that can be generated by intersecting the set with elements of A .

In 1-d, Let A = { - , t , t ϵ R } Possible subsets of { x 1 , ... , x n } generated by intersecting with sets of the form - , t are { x 1 , ... , x n } , { x 1 , ... , x n - 1 } , ... , { x 1 } , φ . Hence S d ( n ) = n + 1 .

In 2-d, Let A = { all rectangles in R 2 }

Consider a set { x 1 , x 2 , x 3 , x 4 } of training points. If we arrange the four points into the corner of a diamond shape. It's easyto see that we can find a rectangle in R 2 to cover any subsets of the four points as the above picture, i.e. S A ( 4 ) = 2 4 = 16 .

Clearly, S A ( n ) = 2 n , n = 1 , 2 , 3 as well.

However, for n = 5 , S A ( n ) < 2 5 . This is because we can always select four points such that the rectangle, which just contains fourof them, contains the other point. Consequently, we cannot find a rectangle classifier which contains the four outer points and does not contain the innerpoint as shown above.

Note the S A 2 n .

If { { x 1 , ... , x n } A , A ϵ A } = 2 n then we say that A shatters x 1 , ... , x n .

Vc dimension

The VC dimension
V A of a collection of sets A is defined as the largest interger n such that S A ( n ) = 2 n .

Sauer's lemma:

Let A be a collection of set with VC dimension V A < . Then n , S A ( n ) i = 0 V A n i , also S A ( n ) ( n + 1 ) V A , n .

Vc dimension and classifiers

Let F be a collection of classifiers of the form f : R d { 0 , 1 } Define A = { { x : f ( x ) = 1 } × { 0 } { x : f ( x ) = 0 } × { 1 } , f ϵ F } In words, this is collection of subsets of X × Y for which on f ϵ F maps the features x to a label opposite of y .  The size of A expresses the richness of F .  The larger A is the more likely it is that there exists an f ϵ F for which R ( f ) = P ( f ( X ) Y ) is close to the Bayes risk R * = P ( f * ( X ) Y ) where f * is the Bayes classifier. The n t h shatter coefficient of F is defined as S F ( n ) = S A ( n ) and the VC dimesion of F is defined as V F = V A .

linear (hyperplane) classifiers in R d

Consider d = 2. Let n be the number of training points, it is easy to see that when n = 1 , let A be as above. By using linear classifiers in R 2 , it is easy to see that we can assign 1 to all possible subsets { { x 1 } , φ } and 0 to their complements. Hence S F ( 1 ) = 2 .

When n = 2 , we can also assign 1 to all possible subsets { { x 1 , x 2 } , { x 1 } , { x 2 } , φ } and 0 to their complements, and vice versa. Hence S F ( 2 ) = 4 = 2 2 .

When n = 3 , we can arrange arrange the point x 1 , x 2 , x 3 (non-colinear) so that the set of linear classifiers shatters the three points, hence S F ( 3 ) = 8 = 2 3

When n = 4 , no matter where the points x 1 , x 2 , x 3 , x 4 and what designated binary values y 1 , y 2 , y 3 , y 4 are. It's clear that A does not shatter the four points. To see the claim, first observe that the four points will form a 4-gon (if the four points are co-linear, or if the three points are co-linear then clearly linear classifiers cannot shatter the points). The two points that belong to the same diagonal lines form 2 groups and no linear classifier can assign different values to the 2 groups. Hence S F ( 4 ) < 16 = 2 4 and V F = 3 .

We state here without proving it that in general the class of linear classifiers in R d has V F = d + 1 .

The vc inequality

Let X 1 , , ... , X n be i.i.d. R d -valued random variables. Denote the common distribution of X i , 1 i n by μ ( A ) = P ( X 1 ϵ A ) for any subset A R d . Similarly, define the empirical distribution μ n ( A ) = 1 n 1 n 1 { X i ϵ A } .

Theorem

Vc '71

For any probablilty measure μ and collection of subsets A , and for any ϵ > 0 .

P s u p A ϵ A μ n ( A ) - μ ( A ) > ϵ 8 S A ( n ) e - n ϵ 2 / 32

and

E s u p A ϵ A μ n ( A ) - μ ( A ) 2 log 2 S A ( n ) n

Before giving a proof to the theorem. We present a Corollary.

Corollary

Let F be a collection of classifiers of the form f : R d { 0 , 1 } with VC dimension V F < ,  Let R ( f ) = P ( f ( X ) Y ) and R ^ n ( f ) = 1 n 1 n 1 { f ( X i ) Y i } , where X i , Y i , 1 i n are i.i.d. with joint distribution P X Y .

Define

f ^ n = a r g m i n f ϵ F R ^ n ( f ) .

Then

E [ R ( f ^ n ) ] - i n f f ϵ F R ( f ) 4 V F log n + 1 + log 2 n .

Let A = { { x : f ( x ) = 1 } × { 0 } { x : f ( x ) = 0 } × { 1 } , f ϵ F }

Note that

P ( f ( X ) Y ) = P ( ( X , Y ) ϵ A ) : = μ ( A )

where A = { x : f ( x ) = 1 } × { 0 } { x : f ( x ) = 0 } × { 1 } .

Similarly,

1 n 1 n 1 { f ( X i ) Y i } = 1 n 1 n 1 { ( X i , Y i ) ϵ A } : = μ ( A ) .

Therefore, according to the VC theorem.

E s u p f ϵ F R ^ n ( f ) - R ( f ) = E s u p A ϵ A μ n ( A ) - μ ( A ) 2 log 2 S A ( n ) n = 2 log 2 S F ( n ) n

Since V F < , S F ( n ) ( n + 1 ) V F and

E s u p f ϵ F R ^ n ( f ) - R ( f ) 2 V F log ( n + 1 ) + log 2 n .

Next, note that

R f ^ n - i n f f ϵ F R ( f ) = R f ^ n - R ^ n f ^ n + R ^ n f ^ n - i n f f ϵ F R ( f ) = R f ^ n - R ^ n f ^ n + s u p f ϵ F R ^ n ( f ^ n ) - R ( f ) R f ^ n - R ^ n f ^ n + s u p f ϵ F R ^ n ( f ) - R ( f ) 2 s u p f ϵ F R ^ n ( f ) - R ( f ) .

Therefore,

E R f ^ n - i n f f ϵ F R ( f ) 2 E s u p f ϵ F R ^ n ( f ) - R ( f ) 4 V F log ( n + 1 ) + log 2 n .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask