<< Chapter < Page Chapter >> Page >

So when your spam classifier goes to compute PFY = 1 given X, it will compute this right here × PFY over – well, all right. And so you look at that terms, say, this will be product from I = 1 to 50,000, PFXI given Y, and one of those probabilities will be equal to zero because PFX30,000 = 1 given Y = 1 is equal to zero. So you have a zero in this product, and so the numerator is zero, and in the same way, it turns out the denominator will also be zero, and so you end up with – actually all of these terms end up being zero. So you end up with PFY = 1 given X is 0 over 0 + 0, okay, which is undefined.

And the problem with this is that it’s just statistically a bad idea to say that PFX30,000 given Y is 0, right? Just because you haven’t seen the word NIPS in your last two months worth of email, it’s also statistically not sound to say that, therefore, the chance of ever seeing this word is zero, right?

And so is this idea that just because you haven’t seen something before, that may mean that that event is unlikely, but it doesn’t mean that it’s impossible, and just saying that if you’ve never seen the word NIPS before, then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero.

So we’re gonna fix this, and to motivate the fix I’ll talk about – the example we’re gonna use is let’s say that you’ve been following the Stanford basketball team for all of their away games, and been, sort of, tracking their wins and losses to gather statistics, and, maybe – I don’t know, form a betting pool about whether they’re likely to win or lose the next game, okay?

So these are some of the statistics. So on, I guess, the 8th of February last season they played Washington State, and they did not win. On the 11th of February, they play Washington, 22nd they played USC, played UCLA, played USC again, and now you want to estimate what’s the chance that they’ll win or lose against Louisville, right?

So find the four guys last year or five times and they weren’t good in their away games, but it seems awfully harsh to say that – so it seems awfully harsh to say there’s zero chance that they’ll win in the last – in the 5th game. So here’s the idea behind Laplace smoothing which is that we’re estimate the probably of Y being equal to one, right? Normally, the maximum likelihood [inaudible] is the number of ones divided by the number of zeros plus the number of ones, okay?

I hope this informal notation makes sense, right? Knowing the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random variable is just the number of ones you saw divided by the total number of examples. So it’s the number of zeros you saw plus the number of ones you saw.

So in the Laplace Smoothing we’re going to just take each of these terms, the number of ones and, sort of, add one to that, the number of zeros and add one to that, the number of ones and add one to that, and so in our example, instead of estimating the probability of winning the next game to be 0 ÷ 5 + 0, we’ll add one to all of these counts, and so we say that the chance of their winning the next game is 1/7th, okay? Which is that having seen them lose, you know, five away games in a row, we aren’t terribly – we don’t think it’s terribly likely they’ll win the next game, but at least we’re not saying it’s impossible.

As a historical side note, the Laplace actually came up with the method. It’s called the Laplace smoothing after him. When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale was in a lot of days now, we’ve seen the sun rise, but that doesn’t mean we can be absolutely certain the sun will rise tomorrow. He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, cool.

So, and more generally, if Y takes on K possible of values, if you’re trying to estimate the parameter of the multinomial, then you estimate PFY = 1. Let’s see. So the maximum likelihood estimate will be Sum from J = 1 to M, indicator YI = J ÷ M, right? That’s the maximum likelihood estimate of a multinomial probability of Y being equal to – oh, excuse me, Y = J. All right. That’s the maximum likelihood estimate for the probability of Y = J, and so when you apply Laplace smoothing to that, you add one to the numerator, and add K to the denominator, if Y can take up K possible values, okay?

So for Naive Bayes, what that gives us is – shoot. Right? So that was the maximum likelihood estimate, and what you end up doing is adding one to the numerator and adding two to the denominator, and this solves the problem of the zero probabilities, and when your friend sends you email about the NIPS conference, your spam filter will still be able to make a meaningful prediction, all right? Okay. Shoot. Any questions about this? Yeah?

Student: So that’s what doesn’t makes sense because, for instance, if you take the games on the right, it’s liberal assumptions that the probability of winning is very close to zero, so, I mean, the prediction should be equal to PF, 0.

Instructor (Andrew Ng) :Right. I would say that in this case the prediction is 1/7th, right? We don’t have a lot of – if you see somebody lose five games in a row, you may not have a lot of faith in them, but as an extreme example, suppose you saw them lose one game, right? It’s just not reasonable to say that the chances of winning the next game is zero, but that’s what maximum likelihood estimate will say.

Student: Yes.

Instructor (Andrew Ng) :And –

Student: In such a case anywhere the learning algorithm [inaudible] or –

Instructor (Andrew Ng) :So some questions of, you know, given just five training examples, what’s a reasonable estimate for the chance of winning the next game, and 1/7th is, I think, is actually pretty reasonable. It’s less than 1/5th for instance. We’re saying the chances of winning the next game is less than 1/5th.

It turns out, under a certain set of assumptions I won’t go into – under a certain set of Bayesian assumptions about the prior and posterior, this Laplace smoothing actually gives the optimal estimate, in a certain sense I won’t go into of what’s the chance of winning the next game, and so under a certain assumption about the Bayesian prior on the parameter. So I don’t know. It actually seems like a pretty reasonable assumption to me. Although, I should say, it actually turned out –

No, I’m just being mean. We actually are a pretty good basketball team, but I chose a losing streak because it’s funnier that way. Let’s see. Shoot. Does someone want to – are there other questions about this? No, yeah. Okay. So there’s more that I want to say about Naive Bayes, but we’ll do that in the next lecture. So let’s wrap it for today.

[End of Audio]

Duration: 76 minutes

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask