<< Chapter < Page Chapter >> Page >

So if one of my training examples is an email with 300 words in it, then I represent this email via a feature vector with 300 elements, and each of these elements of the feature vector – lets see. Let me just write this as X subscript J. These will be an index into my dictionary, okay? And so if my dictionary has 50,000 words, then each position in my feature vector will be a variable that takes on one of 50,000 possible values corresponding to what word appeared in the J position of my email, okay?

So, in other words, I’m gonna take all the words in my email and you have a feature vector that just says which word in my dictionary was each word in the email, okay? So a different definition for NI now, NI now varies and is different for every training example, and this XJ is now indexed into the dictionary. You know, the components of the feature vector are no longer binary random variables; they’re these indices in the dictionary that take on a much larger set of values.

And so our generative model for this will be that the joint distribution over X and Y will be that, where again N is now the length of the email, all right? So the way to think about this formula is you imagine that there was some probably distribution over emails. There’s some random distribution that generates the emails, and that process proceeds as follows: First, Y is chosen, first the class label. Is someone gonna send you spam email or not spam emails is chosen for us.

So first Y, the random variable Y, the class label of spam or not spam is generated, and then having decided whether they sent you spam or not spam, someone iterates over all 300 positions of the email, or 300 words that are going to compose them as email, and would generate words from some distribution that depends on whether they chose to send you spam or not spam. So if they sent you spam, they’ll send you words – they’ll tend to generate words like, you know, buy, and Viagra, and whatever at discounts, sale, whatever. And if somebody chose to send you not spam, then they’ll send you, sort of, the more normal words you get in an email, okay?

So, sort of, just careful, right? XI here has a very different definition from the previous event model, and N has a very different definition from the previous event model. And so the parameters of this model are – let’s see. Phi subscript K given Y equals one, which is the probability that, you know, conditioned on someone deciding to spend you spam, what’s the probability that the next word they choose to email you in the spam email is going to be word K, and similarly, you know, sort of, same thing – well, I’ll just write it out, I guess – and Phi Y and just same as before, okay?

So these are the parameters of the model, and given a training set, you can work out the maximum likelihood estimates of the parameters. So the maximum likelihood estimate of the parameters will be equal to – and now I’m gonna write one of these, you know, big indicator function things again. It’ll be a sum over your training sets indicator whether that was spam times the sum over all the words in that email where N subscript I is the number of words in email I in your training set, times indicator XIJ, SK times that I, okay?

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask