<< Chapter < Page Chapter >> Page >

The em algorithm

In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. In this set of notes,we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problemswith latent variables. We begin our discussion with a very useful result called Jensen's inequality

Jensen's inequality

Let f be a function whose domain is the set of real numbers. Recall that f is a convex function if f ' ' ( x ) 0 (for all x R ). In the case of f taking vector-valued inputs, this is generalized to the condition that its hessian H is positive semi-definite ( H 0 ). If f ' ' ( x ) > 0 for all x , then we say f is strictly convex (in the vector-valued case, the corresponding statement is that H must be positive definite, written H > 0 ). Jensen's inequality can then be stated as follows:

Theorem. Let f be a convex function, and let X be a random variable. Then:

E [ f ( X ) ] f ( E X ) .

Moreover, if f is strictly convex, then E [ f ( X ) ] = f ( E X ) holds true if and only if X = E [ X ] with probability 1 (i.e., if X is a constant).

Recall our convention of occasionally dropping the parentheses when writing expectations, so in the theorem above, f ( E X ) = f ( E [ X ] ) .

For an interpretation of the theorem, consider the figure below.

representation of the theorem

Here, f is a convex function shown by the solid line. Also, X is a random variable that has a 0.5 chance of taking the value a , and a 0.5 chance of taking the value b (indicated on the x -axis). Thus, the expected value of X is given by the midpoint between a and b .

We also see the values f ( a ) , f ( b ) and f ( E [ X ] ) indicated on the y -axis. Moreover, the value E [ f ( X ) ] is now the midpoint on the y -axis between f ( a ) and f ( b ) . From our example, we see that because f is convex, it must be the case that E [ f ( X ) ] f ( E X ) .

Incidentally, quite a lot of people have trouble remembering which way the inequality goes, and remembering a picture like this isa good way to quickly figure out the answer.

Remark. Recall that f is [strictly] concave if and only if - f is [strictly]convex (i.e., f ' ' ( x ) 0 or H 0 ). Jensen's inequality also holds for concave functions f , but with the direction of all the inequalities reversed ( E [ f ( X ) ] f ( E X ) , etc.).

The em algorithm

Suppose we have an estimation problem in which we have a training set { x ( 1 ) , ... , x ( m ) } consisting of m independent examples. We wish to fit the parameters of a model p ( x , z ) to the data, where the likelihood is given by

( θ ) = i = 1 m log p ( x ; θ ) = i = 1 m log z p ( x , z ; θ ) .

But, explicitly finding the maximum likelihood estimates of the parameters θ may be hard. Here, the z ( i ) 's are the latent random variables; and it is often the case that if the z ( i ) 's were observed, then maximum likelihood estimation would be easy.

In such a setting, the EM algorithm gives an efficient method for maximum likelihood estimation. Maximizing ( θ ) explicitly might be difficult, and our strategy will be to instead repeatedlyconstruct a lower-bound on (E-step), and then optimize that lower-bound (M-step).

For each i , let Q i be some distribution over the z 's ( z Q i ( z ) = 1 , Q i ( z ) 0 ). Consider the following: If z were continuous, then Q i would be a density, and the summations over z in our discussion are replaced with integrals over z .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask