<< Chapter < Page | Chapter >> Page > |
Let's work out exactly what distribution our model defines. Our random variables and have a joint Gaussian distribution
We will now find and .
We know that , from the fact that . Also, we have that
Putting these together, we obtain
Next, to find, , we need to calculate (the upper-left block of ), (upper-right block), and (lower-right block).
Now, since , we easily find that . Also,
In the last step, we used the fact that (since has zero mean), and (since and are independent, and hence the expectation of their product is the product of their expectations).Similarly, we can find as follows:
Putting everything together, we therefore have that
Hence, we also see that the marginal distribution of is given by . Thus, given a training set , we can write down the log likelihood of the parameters:
To perform maximum likelihood estimation, we would like to maximize this quantity with respect to the parameters. But maximizing this formula explicitly is hard (try it yourself),and we are aware of no algorithm that does so in closed-form. So, we will instead use to the EM algorithm. In the next section, we derive EM forfactor analysis.
The derivation for the E-step is easy. We need to compute . By substituting the distribution given in Equation [link] into the formulas [link] - [link] used for finding the conditional distribution of a Gaussian, wefind that , where
So, using these definitions for and , we have
Let's now work out the M-step. Here, we need to maximize
with respect to the parameters . We will work out only the optimization with respect to , and leave the derivations of the updates for and as an exercise to the reader.
We can simplify Equation [link] as follows:
Here, the “ ” subscript indicates that the expectation is with respect to drawn from . In the subsequent development, we will omit this subscript when there is no risk of ambiguity.Dropping terms that do not depend on the parameters, we find that we need to maximize:
Let's maximize this with respect to . Only the last term above depends on . Taking derivatives, and using the facts that (for ), , and , we get:
Setting this to zero and simplifying, we get:
Hence, solving for , we obtain
It is interesting to note the close relationship between this equation and the normal equation that we'd derived for least squares regression,
The analogy is that here, the 's are a linear function of the 's (plus noise). Given the “guesses” for that the E-step has found, we will now try to estimate the unknown linearity relating the 's and 's. It is therefore no surprise that we obtain something similar to the normal equation. There is, however,one important difference between this and an algorithm that performs least squares using just the “best guesses” of the 's; we will see this difference shortly.
To complete our M-step update, let's work out the values of the expectations in Equation [link] . From our definition of being Gaussian with mean and covariance , we easily find
The latter comes from the fact that, for a random variable , , and hence . Substituting this back into Equation [link] , we get the M-step update for :
It is important to note the presence of the on the right hand side of this equation. This is the covariance in the posterior distribution of give , and the M-step must take into account this uncertainty about in the posterior. A common mistake in deriving EM is to assume that in the E-step, we need to calculate only expectation of the latent random variable , and then plug that into the optimization in the M-step everywhere occurs. While this worked for simple problems such as the mixture of Gaussians, in ourderivation for factor analysis, we needed as well ; and as we saw, and differ by the quantity . Thus, the M-step update must take into account the covariance of in the posterior distribution .
Lastly, we can also find the M-step optimizations for the parameters and . It is not hard to show that the first is given by
Since this doesn't change as the parameters are varied (i.e., unlike the update for , the right hand sidedoes not depend on , which in turn depends on the parameters), this can be calculated just once and needs not be further updated as the algorithm isrun. Similarly, the diagonal can be found by calculating
and setting (i.e., letting be the diagonal matrix containing only the diagonal entries of ).
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?