<< Chapter < Page Chapter >> Page >
F k = | f | C : f is piecewise linear on 0 , 1 k , 1 k , 2 k , ... k - 1 k , 1 and the coefficients of each line segment are quantized to 1 2 log n bits. .
Example on the quantization of f on interval i - 1 k , i k

The start and end points of each line segment are each one of n discrete values, as indicated in [link] . Since each line may start at any of the n levels and terminate at any of the n levels, there are a total of n possible lines for each segment.

Given that there are k intervals we have

F k = n k log F k = k log n .

Therefore we can use k log n bits to describe a function f F k .

Let

F = k 1 F k .

Construct a prefix code for every f F by

(i) Use 000 1 k b i t s to encode the smallest k such that f F k (ii) Use k log n bits to encode which element of F k we are considering.

Thus, if f F k , then the prefix code associated with f has codeword length

c ( f ) = k + k log n = k ( 1 + log n )

which satisfies the Kraft Inequality

f F 2 - c ( f ) 1 .

Now we will apply our complexity regularization result to select a function f ^ n from F and bound its risk. We are assuming Gaussian errors, so

- log p f ( Y i ) = Y i - f i n 2 2 σ 2 + c o n s t a n t .

We can ignore the constant term and so our empirical selection is

f ^ n = arg min f F 1 n i = 1 n Y i - f i n 2 2 σ 2 + 2 c ( f ) log 2 n .

We can compute f ^ n according to:

For k = 1 , ... , n

f ^ n ( k ) = arg min f F k R n ^ ( f ) = arg min f F k 1 n i = 1 n Y i - f i n 2 2 σ 2

then select

k ^ = arg min k = 1 , ... , n R n ^ f ^ n ( k ) + 2 k ( 1 + log n ) log 2 n

and finally

f ^ n = f ^ n ( k ^ ) .

Because the KL divergence and - 2 log affinity simply reduce to squared error in the Gaussian case (Lecture 14) , we arrive at a relatively simple bound on the mean square error of f ^ n

1 n i = 1 n E f ^ n i n - f * i n 2 min f F 2 n i = 1 n f i n - f * i n 2 + 8 σ 2 c ( f ) log 2 n .

The first term in the brackets above is related to the error incurred by approximating f * by an element of F . The second term is related to the estimation error involved with the model selection process.

Let's focus on the approximation error. First, suppose f * H α C α for 1 < α 2 . Let f k * be the “best" piecewise linear approximation to f * , with k pieces on intervals 0 , 1 k , 1 k , 2 k , ... k - 1 k , 1 . Consider the difference between f * and f k * on one such interval, say i - 1 k , i k . By applying Taylor's theorem with remainder we have

f * ( t ) = f * i k + f * x ( t ' ) t - i k

for t i - 1 k , i k and some t ' t , i k . Define

f k * ( t ) f * i k + f * x i k t - i k .

Note that f k * ( t ) is not necessarily the best piecewise linear approximation to f * , just good enough for our purposes. Then using the fact that f * H α C α , for t [ i - 1 / k , i / k ) we have

f * ( t ) - f k * ( t ) = f * x ( t ' ) t - i k - f * x i k t - i k 1 k f * x ( t ' ) - f * x i k 1 k C α t ' - i k α - 1 1 k C α 1 k α - 1 = C α k - α .

So, for all t [ 0 , 1 ]

f * ( t ) - f k * ( t ) C α k - α .

Now let f k be the element of F k closest to f k * ( f k is the quantized version of f k * )

f * ( t ) - f k ( t ) = f * ( t ) - f k * ( t ) + f k * ( t ) - f k ( t ) f * ( t ) - f k * ( t ) + f k * ( t ) - f k ( t ) C α k - α + 1 n

since we used 1 2 log n bits to quantize the endpoints of each line segment. Consequently,

f * ( t ) - f k * ( t ) 2 f * ( t ) - f k * ( t ) 2 + 2 f * ( t ) - f k * ( t ) f k * ( t ) - f k ( t ) + f k * ( t ) - f k ( t ) 2 C α 2 k - 2 α + 2 C α k - α n + 1 n .

Thus it follows that

min f F k 2 n i = 1 n f ( i / n ) - f * ( i / n ) 2 + 8 σ 2 c ( f ) log 2 n 2 C α 2 k - 2 α + 4 C α k - α n + 2 n + 8 σ 2 k ( log n + 1 ) log 2 n .

The first and last terms dominate the above expression. Therefore, the upper bound is minimizedwhen k - 2 α and k n are balanced. This is accomplished by choosing k = n 1 2 α + 1 . Then it follows that

min f F k 2 n i = 1 n f i n - f * i n 2 + 8 σ 2 c ( f ) log 2 n = O n - 2 α 2 α + 1 log n .

If α = 2 then we have

1 n i = 1 n E f ^ n i n - f * i n 2 = O n - 4 5 log n .

If f * H α C α for 0 < α 1 , let f k * be the following piecewise constant approximation to f * . Let

f k * ( t ) f * i n on interval i - 1 k , i k .

Then

f * ( t ) - f k * ( t ) = f * ( t ) - f * i n C α t - i n α C α k - α .

Repeating the same reasoning as in the 1 < α 2 case, we arrive at

1 n i = 1 n E f ^ n i n - f * i n 2 = O n - 2 α 2 α + 1 log n

for 0 < α 1 . In particular, for α = 1 we get

1 n i = 1 n E f ^ n i n - f * i n 2 = O n - 2 3 log n

within a logarithmic factor of the rate we had before (in Lecture 4 ) for that case!

Summary

  1. f ^ n can be computed by finding least-square line fits to the data on partitions of the form 0 , 1 k , 1 k , 2 k , ... k - 1 k , 1 for k = 1 , ... , n , and then selecting the best fit by the k ^ that gives the minimum of the complexity regularization criterion.
  2. If f * H α C α for some 0 < α 2 , then
    M S E f ^ n = 1 n i = 1 n E f ^ n i n - f * i n 2 = O n - 2 α 2 α + 1 log n .
  3. f ^ n automaticallypicks the optimal number of bins. Essentially f ^ n (indirectly) estimates the smoothness of f * and produces a rate which is near minimax optimal ! ( n - 2 α 2 α + 1 is the best possible).
  4. The larger α is the faster the convergence and the better the denoising !

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask