<< Chapter < Page | Chapter >> Page > |
MachineLearning-Lecture08
Instructor (Andrew Ng) :Okay. Good morning. Welcome back. If you haven’t given me the homework yet, you can just give it to me at the end of class. That’s fine. Let’s see. And also just a quick reminder – I’ve actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great.
Okay. Welcome back. What I want to do today is wrap up our discussion on support vector machines and in particular we’ll also talk about the idea of kernels and then talk about [inaudible] and then I’ll talk about the SMO algorithm, which is an algorithm for solving the optimization problem that we posed last time.
To recap, we wrote down the following context optimization problem. All this is assuming that the data is linearly separable, which is an assumption that I’ll fix later, and so with this optimization problem, given a training set, this will find the optimal margin classifier for the data set that maximizes this geometric margin from your training examples.
And so in the previous lecture, we also derived the dual of this problem, which was to maximize this. And this is the dual of our primal [inaudible] optimization problem. Here, I’m using these angle brackets to denote inner product, so this is just XI transpose XJ for vectors XI and XJ. We also worked out the ways W would be given by sum over I alpha I YI XI.
Therefore, when you need to make a prediction of classification time, you need to compute the value of the hypothesis applied to an [inaudible], which is G of W transpose X plus B where G is that threshold function that outputs plus one and minus one. And so this is G of sum over I alpha I. So that can also be written in terms of inner products between input vectors X.
So what I want to do is now talk about the idea of kernels, which will make use of this property because it turns out you can take the only dependers of the algorithm on X is through these inner products. In fact, you can write the entire algorithm without ever explicitly referring to an X vector [inaudible] between input feature vectors. And the idea of a high kernel is as following – let’s say that you have an input attribute. Let’s just say for now it’s a real number. Maybe this is the living area of a house that you’re trying to make a prediction on, like whether it will be sold in the next six months.
Quite often, we’ll take this feature X and we’ll map it to a richer set of features. So for example, we will take X and map it to these four polynomial features, and let me acutely call this mapping Phi. So we’ll let Phi of X denote the mapping from your original features to some higher dimensional set of features.
So if you do this and you want to use the features Phi of X, then all you need to do is go back to the learning algorithm and everywhere you see XI, XJ, we’ll replace it with the inner product between Phi of XI and Phi of XJ. So this corresponds to running a support vector machine with the features given by Phi of X rather than with your original one-dimensional input feature X.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?