<< Chapter < Page | Chapter >> Page > |
This is a resonable measure of and 's similarity, and is close to 1 when and are close, and near 0 when and are far apart. Can we use this definition of as the kernel in an SVM? In this particular example, the answer is yes. (This kernel is called the Gaussian kernel , and corresponds to an infinite dimensional feature mapping .) But more broadly, given some function , how can we tell if it's a valid kernel; i.e., can we tell if there is some feature mapping so that for all , ?
Suppose for now that is indeed a valid kernel corresponding to some feature mapping . Now, consider some finite set of points (not necessarily the training set) , and let a square, -by- matrix be defined so that its -entry is given by . This matrix is called the Kernel matrix . Note that we've overloaded the notation and used to denote both the kernel function and the kernel matrix , due to their obvious close relationship.
Now, if is a valid Kernel, then , and hence must be symmetric. Moreover, letting denote the -th coordinate of the vector , we find that for any vector , we have
The second-to-last step above used the same trick as you saw in Problem set 1 Q1. Since was arbitrary, this shows that is positive semi-definite ( ).
Hence, we've shown that if is a valid kernel (i.e., if it corresponds to some feature mapping ), then the corresponding Kernel matrix is symmetric positive semidefinite. More generally, this turns out to be not only a necessary, butalso a sufficient, condition for to be a valid kernel (also called a Mercer kernel). The following result is due to Mercer. Many texts present Mercer's theorem in a slightly more complicated form involving functions, but when the input attributes take values in , the version given here is equivalent.
Theorem (Mercer). Let be given. Then for to be a valid (Mercer) kernel, it is necessary and sufficient that for any , ( ), the corresponding kernel matrix is symmetric positive semi-definite.
Given a function , apart from trying to find a feature mapping that corresponds to it, this theorem therefore gives another way of testing if it is a valid kernel.You'll also have a chance to play with these ideas more in problem set 2.
In class, we also briefly talked about a couple of other examples of kernels. For instance, consider the digit recognition problem, in which given an image (16x16 pixels) of a handwrittendigit (0-9), we have to figure out which digit it was. Using either a simple polynomial kernel or the Gaussian kernel, SVMs were able to obtain extremely good performance on this problem.This was particularly surprising since the input attributes were just a 256-dimensional vector of the image pixel intensity values, and the system had no prior knowledge about vision,or even about which pixels are adjacent to which other ones. Another example that we briefly talked about in lecture was that if the objects that we are trying to classify are strings (say, is a list of amino acids, which strung together form a protein), then it seems hard to construct a reasonable, “small”set of features for most learning algorithms, especially if different strings have different lengths. However, consider letting be a feature vector that counts the number of occurrences of each length- substring in . If we're considering strings of english letters, then there are such strings. Hence, is a dimensional vector; even for moderate values of , this is probably too big for us to efficiently work with. (e.g., .) However, using (dynamic programming-ish) string matching algorithms, it is possible to efficientlycompute , so that we can now implicitly work in this -dimensional feature space, but without ever explicitly computing feature vectors in this space.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?