<< Chapter < Page Chapter >> Page >

The best quadratic function – by best I mean in the sense of generalization error – the hypothesis function – the quadratic function with the lowest generalization error has to have equal or more likely lower generalization error than the best linear function. So by switching to a more complex hypothesis class you can get this first term as you go down. But what I pay for then is that K will increase. By switching to a larger hypothesis class, the first term will go down, but the second term will increase because I now have a larger class of hypotheses and so the second term K will increase.

And so this is sometimes called the bias – this is usually called the bias variance tradeoff. Whereby going to larger hypothesis class maybe I have the hope for finding a better function, that my risk of sort of not fitting my model so accurately also increases, and that’s because – illustrated by the second term going up when the size of your hypothesis, when K goes up. And so speaking very loosely, we can think of this first term as corresponding maybe to the bias of the learning algorithm, or the bias of the hypothesis class. And you can – again speaking very loosely, think of the second term as corresponding to the variance in your hypothesis, in other words how well you can actually fit a hypothesis in the – how well you actually fit this hypothesis class to the data. And by switching to a more complex hypothesis class, your variance increases and your bias decreases.

As a note of warning, it turns out that if you take like a statistics class you’ve seen definitions of bias and variance, which are often defined in terms of squared error or something. It turns out that for classification problems, there actually is no universally accepted formal definition of bias and variance for classification problems. For regression problems, there is this square error definition. For classification problems it turns out there’ve been several competing proposals for definitions of bias and variance. So when I say bias and variance here, think of these as very loose, informal, intuitive definitions, and not formal definitions. Okay. The cartoon associated with intuition I just said would be as follows: Let’s say – and everything about the plot will be for a fixed value of M, for a fixed training set size M. Vertical axis I’ll plot ever and on the horizontal axis I’ll plot model complexity. And by model complexity I mean sort of degree of polynomial, size of your hypothesis class script H etc. It actually turns out, you remember the bandwidth parameter from locally weighted linear regression, that also has a similar effect in controlling how complex your model is. Model complexity [inaudible] polynomial I guess. So the more complex your model, the better your training error, and so your training error will tend to [inaudible]zero as you increase the complexity of your model because the more complete your model the better you can fit your training set.

But because of this bias variance tradeoff, you find that generalization error will come down for a while and then it will go back up. And this regime on the left is when you’re underfitting the data or when you have high bias. And this regime on the right is when you have high variance or you’re overfitting the data. Okay? And this is why a model of sort of intermediate complexity, somewhere here if often preferable to if [inaudible] and minimize generalization error. Okay? So that’s just a cartoon. In the next lecture we’ll actually talk about the number of algorithms for trying to automatically select model complexities, say to get you as close as possible to this minimum – to this area of minimized generalization error. The last thing I want to do is actually going back to the theorem I wrote out, I just want to take that theorem – well, so the theorem I wrote out was an error bound theorem this says for fixed M and delta where probability one minus delta, I get a bound on gamma, which is what this term is. So the very last thing I wanna do today is just come back to this theorem and write out a corollary where I’m gonna fix gamma, I’m gonna fix my error bound, and fix delta and solve for M. And if you do that, you get the following corollary: Let H be fixed with K hypotheses and let any delta and gamma be fixed.

Then in order to guarantee that, let’s say I want a guarantee that the generalization error of the hypothesis I choose with empirical risk minimization, that this is at most two times gamma worse than the best possible error I could obtain with this hypothesis class. Lets say I want this to hold true with probability at least one minus delta, then it suffices that M is [inaudible] to that. Okay? And this is sort of solving for the error bound for M. One thing we’re going to convince yourselves of the easy part of this is if you set that term [inaudible]gamma and solve for M you will get this. One thing I want you to go home and sort of convince yourselves of is that this result really holds true. That this really logically follows from the theorem we’ve proved. In other words, you can take that formula we wrote and solve for M and – because this is the formula you get for M, that’s just – that’s the easy part. That once you go back and convince yourselves that this theorem is a true fact and that it does indeed logically follow from the other one. In particular, make sure that if you solve for that you really get M grading equals this, and why is this M grading that and not M less equal two, and just make sure – I can write this down and it sounds plausible why don’t you just go back and convince yourself this is really true. Okay?

And it turns out that when we prove these bounds in learning theory it turns out that very often the constants are sort of loose. So it turns out that when we prove these bounds usually we’re interested – usually we’re not very interested in the constants, and so I write this as big O of one over gamma squared, log K over delta, and again, the key step in this is that the dependence on M with the size of the hypothesis class is logarithmic. And this will be very important later when we talk about infinite hypothesis classes. Okay? Any questions about this? No? Okay, cool. So next lecture we’ll come back, we’ll actually start from this result again. Remember this. I’ll write this down as the first thing I do in the next lecture and we’ll generalize these to infinite hypothesis classes and then talk about practical algorithms for model spectrum. So I’ll see you guys in a couple days.

[End of Audio]

Duration: 75 minutes

Questions & Answers

are nano particles real
Missy Reply
Hello, if I study Physics teacher in bachelor, can I study Nanotechnology in master?
Lale Reply
no can't
where we get a research paper on Nano chemistry....?
Maira Reply
nanopartical of organic/inorganic / physical chemistry , pdf / thesis / review
what are the products of Nano chemistry?
Maira Reply
There are lots of products of nano chemistry... Like nano coatings.....carbon fiber.. And lots of others..
Even nanotechnology is pretty much all about chemistry... Its the chemistry on quantum or atomic level
no nanotechnology is also a part of physics and maths it requires angle formulas and some pressure regarding concepts
Preparation and Applications of Nanomaterial for Drug Delivery
Hafiz Reply
Application of nanotechnology in medicine
has a lot of application modern world
what is variations in raman spectra for nanomaterials
Jyoti Reply
ya I also want to know the raman spectra
I only see partial conversation and what's the question here!
Crow Reply
what about nanotechnology for water purification
RAW Reply
please someone correct me if I'm wrong but I think one can use nanoparticles, specially silver nanoparticles for water treatment.
yes that's correct
I think
Nasa has use it in the 60's, copper as water purification in the moon travel.
nanocopper obvius
what is the stm
Brian Reply
is there industrial application of fullrenes. What is the method to prepare fullrene on large scale.?
industrial application...? mmm I think on the medical side as drug carrier, but you should go deeper on your research, I may be wrong
How we are making nano material?
what is a peer
What is meant by 'nano scale'?
What is STMs full form?
scanning tunneling microscope
how nano science is used for hydrophobicity
Do u think that Graphene and Fullrene fiber can be used to make Air Plane body structure the lightest and strongest. Rafiq
what is differents between GO and RGO?
what is simplest way to understand the applications of nano robots used to detect the cancer affected cell of human body.? How this robot is carried to required site of body cell.? what will be the carrier material and how can be detected that correct delivery of drug is done Rafiq
analytical skills graphene is prepared to kill any type viruses .
Any one who tell me about Preparation and application of Nanomaterial for drug Delivery
what is Nano technology ?
Bob Reply
write examples of Nano molecule?
The nanotechnology is as new science, to scale nanometric
nanotechnology is the study, desing, synthesis, manipulation and application of materials and functional systems through control of matter at nanoscale
Is there any normative that regulates the use of silver nanoparticles?
Damian Reply
what king of growth are you checking .?
how did you get the value of 2000N.What calculations are needed to arrive at it
Smarajit Reply
Privacy Information Security Software Version 1.1a
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now

Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?