<< Chapter < Page | Chapter >> Page > |
Then in parts of notation, I guess, I define this term here to be the likelihood of theta. And the likely of theta is just the probability of the data Y, right? Given X and prioritized by theta. To test the likelihood and probability are often confused. So the likelihood of theta is the same thing as the probability of the data you saw. So likely and probably are, sort of, the same thing. Except that when I use the term likelihood I’m trying to emphasize that I’m taking this thing and viewing it as a function of theta. Okay? So likelihood and for probability, they’re really the same thing except that when I want to view this thing as a function of theta holding X and Y fix are then called likelihood. Okay? So hopefully you hear me say the likelihood of the parameters and the probability of the data, right? Rather than the likelihood of the data or probability of parameters. So try to be consistent in that terminology.
So given that the probability of the data is this and this is also the likelihood of the parameters, how do you estimate the parameters theta? So given a training set, what parameters theta do you want to choose for your model? Well, the principle of maximum likelihood estimation says that, right? You can choose the value of theta that makes the data as probable as possible, right? So choose theta to maximize the likelihood. Or in other words choose the parameters that make the data as probable as possible, right? So this is massive likely your estimation from six to six. So it’s choose the parameters that makes it as likely as probable as possible for me to have seen the data I just did.
So for mathematical convenience, let me define lower case l of theta. This is called the log likelihood function and it’s just log of capital L of theta. So this is log over product over I to find sigma E to that. I won’t bother to write out what’s in the exponent for now. It’s just saying this from the previous board. Log and a product is the same as the sum of over logs, right? So it’s a sum of the logs of – which simplifies to m times one over root two pi sigma plus and then log of explanation cancel each other, right? So if log of E of something is just whatever’s inside the exponent. So, you know what, let me write this on the next board.
Okay. So maximizing the likelihood or maximizing the log likelihood is the same as minimizing that term over there. Well, you get it, right? Because there’s a minus sign. So maximizing this because of the minus sign is the same as minimizing this as a function of theta. And this is, of course, just the same quadratic cos function that we had last time, J of theta, right? So what we’ve just shown is that the ordinary least squares algorithm, that we worked on the previous lecture, is just maximum likelihood assuming this probabilistic model, assuming IID Gaussian errors on our data. Okay?
One thing that we’ll actually leave is that, in the next lecture notice that the value of sigma squared doesn’t matter, right? That somehow no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It’s a variance of a Gaussian. So that no matter what sigma squared is since it’s a positive number the value of theta we end up with will be the same, right? So because minimizing this you get the same value of theta no matter what sigma squared is. So it’s as if in this model the value of sigma squared doesn’t really matter. Just remember that for the next lecture. We’ll come back to this again. Any questions about this? Actually, let me clean up another couple of boards and then I’ll see what questions you have.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?