<< Chapter < Page Chapter >> Page >

So the very first thing we’re gonna talk about is something that you’ve probably already seen on the first homework, and something that alluded to previously, which is the bias variance trade-off. So take ordinary least squares, the first learning algorithm we learned about, if you [inaudible] a straight line through these datas, this is not a very good model. Right. And if this happens, we say it has underfit the data, or we say that this is a learning algorithm with a very high bias, because it is failing to fit the evident quadratic structure in the data. And for the prefaces of [inaudible]you can formally think of the bias of the learning algorithm as representing the fact that even if you had an infinite amount of training data, even if you had tons of training data, this algorithm would still fail to fit the quadratic function – the quadratic structure in the data. And so we think of this as a learning algorithm with high bias. Then there’s the opposite problem, so that’s the same dataset. If you fit a fourth of the polynomials into this dataset, then you have – you’ll be able to interpolate the five data points exactly, but clearly, this is also not a great model to the structure that you and I probably see in the data.

And we say that this algorithm has a problem – excuse me, is overfitting the data, or alternatively that this algorithm has high variance. Okay? And the intuition behind overfitting a high variance is that the algorithm is fitting serious patterns in the data, or is fitting idiosyncratic properties of this specific dataset, be it the dataset of housing prices or whatever. And quite often, they’ll be some happy medium of fitting a quadratic function that maybe won’t interpolate your data points perfectly, but also captures multi-structure in your data than a simple model which under fits. I say that you can sort of have the exactly the same picture of classification problems as well, so lets say this is my training set, right, of positive and negative examples, and so you can fit logistic regression with a very high order polynomial [inaudible], or [inaudible] of X equals the sigmoid function of – whatever. Sigmoid function applied to a tenth of the polynomial. And you do that, maybe you get a decision boundary like this. Right. That does indeed perfectly separate the positive and negative classes, this is another example of how overfitting, and in contrast you fit logistic regression into this model with just the linear features, with none of the quadratic features, then maybe you get a decision boundary like that, which can also underfit. Okay.

So what I want to do now is understand this problem of overfitting versus underfitting, of high bias versus high variance, more explicitly, I will do that by posing a more formal model of machine learning and so trying to prove when these two twin problems – when each of these two problems come up. And as I’m modeling the example for our initial foray into learning theory, I want to talk about learning classification, in which H of X is equal to G of data transpose X. Okay? So the learning classifier. And for this class I’m going to use, Z – excuse me – I’m gonna use G as indicator Z grading with zero. With apologies in advance for changing the notation yet again, for the support vector machine lectures we use Y equals minus one or plus one. For learning theory lectures, turns out it’ll be a bit cleaner if I switch back to Y equals zero-one again, so I’m gonna switch back to my original notation. And so you think of this model as a model forum as logistic regressions, say, and think of this as being similar to logistic regression, except that now we’re going to force the logistic regression algorithm, to opt for labels that are either zero or one. Okay? So you can think of this as a classifier to opt for labels zero or one involved in the probabilities. And so as usual, let’s say we’re given a training set of M examples. That’s just my notation for writing a set of M examples ranging from I equals one through M. And I’m going to assume that the training example is XIYI. I’ve drawn IID, from sum distribution, scripts D. Okay? [Inaudible]. Identically and definitively distributed and if you have – you have running a classification problem on houses, like features of the house comma, whether the house will be sold in the next six months, then this is just the priority distribution over features of houses and whether or not they’ll be sold. Okay?

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask