<< Chapter < Page Chapter >> Page >

Complexity regularization in regression

Recall the classification problem. In Lecture 6 , where we assumed that min f F R ( f ) = 0 , we obtained the PAC bound f F

P { R ( f ^ n ) > ϵ } | F | e - n ϵ .

From Corrolary 1 in Lecture 6 ,

E [ R ( f ^ n ) ] 1 + log | F | n .

In Lectures 7 and 8 , we dropped the assumption that min f F R ( f ) = 0 and obtained, f F

P { R ( f ^ n ) > ϵ } | F | e - 2 n ϵ 2 .

This led to

E [ R ( f ^ n ) - min f F R ( f ) ] log | F | + log n + 2 n .

Hoeffding's inequality was central to our analysis of learning under bounded loss functions. In many regression and signal estimationproblems it is natural to consider squared error loss functions (rather than 0 / 1 or absolute error). In such cases, we will need to derive bounds using different techniques.

To illustrate the distinction between classification andregression, consider a simple, scalar signal plus noise problem. Consider Y i = θ + W i , i = 1 , , n , where θ is a fixed unknown scalar parameter and the W i are independent, zero-mean, unit variance random variables. Let Y ¯ = 1 / n i = 1 n Y i . Then, according to the Central Limit Theorem, Y ¯ is distributed approximately N ( θ , 1 / n ) . A simple tail-bound on the Gaussian distribution gives us

P ( Y ¯ - θ > ϵ ) = P ( W > ϵ ) 1 2 e - n ϵ 2 / 2 ,

which implies that

P ( | Y ¯ - θ | 2 > ϵ ) e - n ϵ 2 / 2 .

This is a bound on the deviations of the squared error err 2 = | Y ¯ - θ | 2 . Notice that the exponential decay rate is a function of ϵ rather than ϵ 2 , as in Hoeffding's inequality. The squared error concentration inequality implies that E [ | Y ¯ - θ | 2 ] = O ( 1 n ) (just write E [ err 2 ] = 0 P ( err 2 > t ) d t ). Therefore, in regression with a squared error loss, we can hope to get a rate of convergence as fastas n - 1 instead of n - 1 / 2 . The reason is simply because we are using an squared error loss instead of the 0 / 1 or absolute error loss.

To begin our investigation into regression and function estimation, let us consider the following. Let X = R d and Y = R . Take F such that f F is a map f : R d R . We have training data { X i , Y i } i = 1 n i . i . d . P X Y . As our loss function, we take the squared error, i.e.,

l ( f ( X i ) , Y i ) = ( f ( X i ) - Y i ) 2 .

The risk is then the MSE:

R ( f ) = E [ ( f ( X ) - Y ) 2 ] .

We know that the function f * that minimizes the MSE is just the conditional expectation of Y given X:

f * = E [ Y | X = x ] .

Now let R * = R ( f * ) . We would like to select an f ^ n F using the training data { X i , Y i } i = 1 n such that the excess risk

E [ R ( f ^ n ) ] - R * 0

is small. Let's consider the difference between the empirical risks:

R ^ ( f ) - R ^ ( f * ) = 1 n i = 1 n ( f ( X i ) - Y i ) 2 - 1 n i = 1 n ( f * ( X i ) - Y i ) 2 .

Note that E [ R ^ ( f ) - R ^ ( f * ) ] = R ( f ) - R ( f * ) . Hence, by the Strong Law of Large Numbers (SLLN), we know that

R ^ ( f ) - R ^ ( f * ) R ( f ) - R ( f * )

as n . But how fast is this convergence?

We will derive a PAC style bound for the difference R ^ ( f ) - R ^ ( f * ) - ( R ( f ) - R ( f * ) ) . The following derivation is from Barron 1991. The excess risk and it empirical counterpart will be denoted by

r ( f , f * ) = R ( f ) - R ( f * ) r ^ ( f , f * ) = R ^ ( f ) - R ^ ( f * ) .

Note that r ^ ( f , f * ) is the sum of independent random variables:

r ^ ( f , f * ) = - 1 n i = 1 n U i ,

where U i = - ( Y i - f ( X i ) ) 2 + ( Y i - f * ( X i ) ) 2 . Therefore, r ( f , f * ) - r ^ ( f , f * ) = 1 n i = 1 n ( U i - E [ U i ] ) .

We are looking for a PAC bound of the form

P ( r ( f , f * ) - r ^ ( f , f * ) > ϵ ) < δ .

If the variables U i are bounded, then we can apply Hoeffding's inequality. However, a more useful bound for our regression problem can be derivedif the the variables U i satisfy the following moment condition:

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask