<< Chapter < Page Chapter >> Page >

Consider the following setting. Let

Y = f * ( X ) + W ,

where X is a random variable (r.v.) on X = [ 0 , 1 ] , W is a r.v. on Y = R , independent of X and satisfying

E [ W ] = 0 and E [ W 2 ] = σ 2 < .

Finally let f * : [ 0 , 1 ] R be a function satisfying

| f * ( t ) - f * ( s ) | L | t - s | , t , s [ 0 , 1 ] ,

where L > 0 is a constant. A function satisfying condition [link] is said to be Lipschitz on [ 0 , 1 ] . Notice that such a function must be continuous, but it is not necessarilydifferentiable. An example of such a function is depicted in [link] (a).

Example of a Lipschitz function, and our observations setting. (a) random sampling of f * , the points correspond to ( X i , Y i ) , i = 1 , ... , n ; (b) deterministic sampling of f * , the points correspond to ( i / n , Y i ) , i = 1 , ... , n .

Note that

E [ Y | X = x ] = E [ f * ( X ) + W | X = x ] = E [ f * ( x ) + W | X = x ] = f * ( x ) + E [ W ] = f * ( x ) .

Consider our usual setup: Estimate f * using n training examples

{ X i , Y i } i = 1 n i . i . d . P X Y , Y i = f * ( X i ) + W i , i = { 1 , ... , n } ,

where i . i . d . means independently and identically distributed . [link] (a) illustrates this setup.

In many applications we can sample X = [ 0 , 1 ] as we like, and not necessarily at random. For example we can take n samples uniformly on [0,1]

x i = i n , i = 1 , ... , n , Y i = f ( x i ) + W i = f i n + W i .

We will proceed with this setup (as in [link] (b)) in the rest of the lecture.

Our goal is to find f ^ n such that E [ f * - f ^ n 2 ] 0 , as n 0 (here · is the usual L 2 -norm; i.e., f * - f ^ n 2 = 0 1 | f * ( t ) - f ^ n ( t ) | 2 d t ).

Let

F = { f : f is Lipschitz with constant L } .

The Riskis defined as

R ( f ) = f * - f 2 = 0 1 | f * ( t ) - f ( t ) | 2 d t .

The Expected Risk (recall that ourestimator f ^ n is based on { x i , Y i } and hence is a r.v.) is defined as

E [ R ( f ^ n ) ] = E [ f * - f ^ n 2 ] .

Finally the Empirical Riskis defined as

R ^ n ( f ) = 1 n i = 1 n f i n - Y i 2 .

Let 0 < m 1 m 2 m 3 be a sequence of integers satisfying m n as n , and k n m n = n for some integer k n > 0 . That is, for each value of n there is an associated integer value m n . Define the Sieve F 1 , F 2 , F 3 , ... ,

F n = f : f ( t ) = j = 1 m n c j 1 { j - 1 m n t < j m n } , c j R .

F n is the space of functions that are constant on intervals

I j , m n j - 1 m n , j m n , j = 1 , ... , m n .

From here on we will use m and k instead of m n and k n (dropping the subscript n ) for notational ease. Define

f n ( t ) = j = 1 m c j * 1 { t I j , m } , where c j * = 1 k i : i n I j , m f * i n .

Note that f n F n .

Exercise 1

Upper bound f * - f n 2 .

f * - f 2 = 0 1 | f * ( t ) - f n ( t ) | 2 d t = j = 1 m I j , m | f * ( t ) - f n ( t ) | 2 d t = j = 1 m I j , m | f * ( t ) - c j * | 2 d t = j = 1 m I j , m f * ( t ) - 1 k i : i n I j , m f * i n 2 d t = j = 1 m I j , m 1 k i : i n I j , m f * ( t ) - f * i n 2 d t j = 1 m I j , m 1 k i : i n I j , m f * ( t ) - f * i n 2 d t j = 1 m I j , m 1 k i : i n I j , m L m 2 d t = j = 1 m I j , m L m 2 d t = j = 1 m 1 m L m 2 = L m 2 .

The above implies that f * - f n 2 0 as n , since m = m n as n . In words, with n sufficiently large we can approximate f * to arbitrary accuracy using models in F n (even if the functions we are using to approximate f * are not Lipschitz!).

For any f F n , f = j = 1 m c j 1 { t I j , m } , we have

R ^ n ( f ) = 1 n j = 1 m i : i n I j , m ( c j - Y i ) 2 .

Let f ^ n = arg min f F n R ^ n ( f ) . Then

f ^ n ( t ) = j = 1 m c ^ j 1 { t I j , m } , where c ^ j = 1 k i : i n I j , m Y i

Exercise 2

Show [link] .

Note that E [ c ^ j ] = c j * and therefore E [ f ^ n ( t ) ] = f n ( t ) . Lets analyze now the expected risk of f ^ n :

E [ f * - f ^ n 2 ] = E [ f * - f n + f n - f ^ n 2 ] = f * - f n 2 + E [ f n - f ^ n 2 ] + 2 E [ f * - f n , f n - f ^ n ] = f * - f n 2 + E [ f n - f ^ n 2 ] + 2 f * - f n , E [ f n - f ^ n ] = f * - f n 2 + E [ f n - f ^ n 2 ] ,

where the final step follows from the fact that E [ f ^ n ( t ) ] = f n ( t ) . A couple of important remarks pertaining the right-hand-side of equation [link] : The first term, f * - f n 2 , corresponds to the approximation error, and indicates how well can we approximate the function f * with a function from F n . Clearly, the larger the class F n is, the smallest we can make this term. This term is precisely the squared bias of the estimator f ^ n . The second term, E [ f n - f ^ n 2 ] , is the estimation error, the variance of our estimator. We will see that the estimation erroris small if the class of possible estimators F n is also small.

The behavior of the first term in [link] was already studied. Consider the other term:

E [ f n - f ^ n 2 ] = E 0 1 | f n ( t ) - f ^ n ( t ) | 2 d t = E j = 1 m I j , m | c j * - c ^ j | 2 d t = j = 1 m I j , m E [ | c j * - c ^ j | 2 ] d t = j = 1 m I j , m E [ W 2 ] k d t j = 1 m I j , m σ 2 k d t = j = 1 m 1 m σ 2 k = σ 2 k = m n σ 2 .

Combining all the facts derived we have

E [ f * - f ^ n 2 ] L 2 m 2 + m n σ 2 = O max 1 m 2 , m n .

This equation used Big-O notation.

What is the best choice of m ? If m is small then the approximation error ( i.e., O ( 1 / m 2 ) ) is going to be large, but the estimation error ( i.e., O ( m / n ) ) is going to be small, and vice-versa. This two conflicting goals provide a tradeoff that directs our choice of m (as a function of n ). In [link] we depict this tradeoff. In [link] (a) we considered a large m n value, and we see that the approximation of f * by a function in the class F n can be very accurate (that is, our estimate will have a small bias), but when we use the measured dataour estimate looks very bad (high variance). On the other hand, as illustrated in [link] (b), using a very small m n allows our estimator to get very close to the best approximating function in the class F n , so we have a low variance estimator, but the bias of our estimator ( i.e., the difference between f n and f * ) is quite considerable.

Approximation and estimation of f * (in blue) for n = 60 . The function f n is depicted in green and the function f ^ n is depicted in red. In (a)we have m = 60 and in (b) we have m = 6 .

We need to balance the two terms in the right-hand-side of [link] in order to maximize the rate of decay (with n ) of the expected risk. This implies that 1 m 2 = m n therefore m n = n 1 / 3 and the Mean Squared Error (MSE) is

E [ f n - f ^ n 2 ] = O ( n - 2 / 3 ) .

So the sieve F 1 , F 2 , with

F n = f : f ( t ) = j = 1 m n c j 1 { j - 1 m n t < j m n } , c j R ,

produces a F -consistent estimator for f * = E [ Y | X + x ] F .

It is interesting to note that the rate of decay of the MSE we obtainwith this strategy cannot be further improved by using more sophisticated estimation techniques (that is, n - 2 / 3 is the minimax MSE rate for this problem). Also, rather surprisingly, we are considering classes of models F n that are actually not Lipschitz, therefore our estimator of f * is not a Lipschitz function, unlike f * itself.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask