<< Chapter < Page Chapter >> Page >

This is a resonable measure of x and z 's similarity, and is close to 1 when x and z are close, and near 0 when x and z are far apart. Can we use this definition of K as the kernel in an SVM? In this particular example, the answer is yes. (This kernel is called the Gaussian kernel , and corresponds to an infinite dimensional feature mapping Φ .) But more broadly, given some function K , how can we tell if it's a valid kernel; i.e., can we tell if there is some feature mapping Φ so that K ( x , z ) = Φ ( x ) T Φ ( z ) for all x , z ?

Suppose for now that K is indeed a valid kernel corresponding to some feature mapping Φ . Now, consider some finite set of m points (not necessarily the training set) { x ( 1 ) , ... , x ( m ) } , and let a square, m -by- m matrix K be defined so that its ( i , j ) -entry is given by K i j = K ( x ( i ) , x ( j ) ) . This matrix is called the Kernel matrix . Note that we've overloaded the notation and used K to denote both the kernel function K ( x , z ) and the kernel matrix K , due to their obvious close relationship.

Now, if K is a valid Kernel, then K i j = K ( x ( i ) , x ( j ) ) = Φ ( x ( i ) ) T Φ ( x ( j ) ) = Φ ( x ( j ) ) T Φ ( x ( i ) ) = K ( x ( j ) , x ( i ) ) = K j i , and hence K must be symmetric. Moreover, letting Φ k ( x ) denote the k -th coordinate of the vector Φ ( x ) , we find that for any vector z , we have

z T K z = i j z i K i j z j = i j z i Φ ( x ( i ) ) T Φ ( x ( j ) ) z j = i j z i k Φ k ( x ( i ) ) Φ k ( x ( j ) ) z j = k i j z i Φ k ( x ( i ) ) Φ k ( x ( j ) ) z j = k i z i Φ k ( x ( i ) ) 2 0 .

The second-to-last step above used the same trick as you saw in Problem set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite ( K 0 ).

Hence, we've shown that if K is a valid kernel (i.e., if it corresponds to some feature mapping Φ ), then the corresponding Kernel matrix K R m × m is symmetric positive semidefinite. More generally, this turns out to be not only a necessary, butalso a sufficient, condition for K to be a valid kernel (also called a Mercer kernel). The following result is due to Mercer. Many texts present Mercer's theorem in a slightly more complicated form involving L 2 functions, but when the input attributes take values in R n , the version given here is equivalent.

Theorem (Mercer). Let K : R n × R n R be given. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any { x ( 1 ) , ... , x ( m ) } , ( m < ), the corresponding kernel matrix is symmetric positive semi-definite.

Given a function K , apart from trying to find a feature mapping Φ that corresponds to it, this theorem therefore gives another way of testing if it is a valid kernel.You'll also have a chance to play with these ideas more in problem set 2.

In class, we also briefly talked about a couple of other examples of kernels. For instance, consider the digit recognition problem, in which given an image (16x16 pixels) of a handwrittendigit (0-9), we have to figure out which digit it was. Using either a simple polynomial kernel K ( x , z ) = ( x T z ) d or the Gaussian kernel, SVMs were able to obtain extremely good performance on this problem.This was particularly surprising since the input attributes x were just a 256-dimensional vector of the image pixel intensity values, and the system had no prior knowledge about vision,or even about which pixels are adjacent to which other ones. Another example that we briefly talked about in lecture was that if the objects x that we are trying to classify are strings (say, x is a list of amino acids, which strung together form a protein), then it seems hard to construct a reasonable, “small”set of features for most learning algorithms, especially if different strings have different lengths. However, consider letting Φ ( x ) be a feature vector that counts the number of occurrences of each length- k substring in x . If we're considering strings of english letters, then there are 26 k such strings. Hence, Φ ( x ) is a 26 k dimensional vector; even for moderate values of k , this is probably too big for us to efficiently work with. (e.g., 26 4 460000 .) However, using (dynamic programming-ish) string matching algorithms, it is possible to efficientlycompute K ( x , z ) = Φ ( x ) T Φ ( z ) , so that we can now implicitly work in this 26 k -dimensional feature space, but without ever explicitly computing feature vectors in this space.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask