2.9 Machine learning lecture 9 course notes (Page 3/3)

Machine learning Page 3 / 3

Let's work out exactly what distribution our model defines. Our random variables $z$ and $x$ have a joint Gaussian distribution

[\begin{matrix} z \\ x \end{matrix}] \sim N (μ_{z x}, Σ) .

We will now find $μ_{z x}$ and $Σ$ .

We know that $E [z] = 0$ , from the fact that $z \sim N (0, I)$ . Also, we have that

\begin{matrix} E [x] & = & E [μ + Λ z + ϵ] \\ = & μ + Λ E [z] + E [ϵ] \\ = & μ . \end{matrix}

Putting these together, we obtain

μ_{z x} = [\begin{matrix} \vec{0} \\ μ \end{matrix}]

Next, to find, $Σ$ , we need to calculate $Σ_{z z} = E [(z - E [z]) {(z - E [z])}^{T}]$ (the upper-left block of $Σ$ ), $Σ_{z x} = E [(z - E [z]) {(x - E [x])}^{T}]$ (upper-right block), and $Σ_{x x} = E [(x - E [x]) {(x - E [x])}^{T}]$ (lower-right block).

Now, since $z \sim N (0, I)$ , we easily find that $Σ_{z z} = Cov (z) = I$ . Also,

\begin{matrix} E [(z - E [z]) {(x - E [x])}^{T}] & = & E [z {(μ + Λ z + ϵ - μ)}^{T}] \\ = & E [z z^{T}] Λ^{T} + E [z ϵ^{T}] \\ = & Λ^{T} . \end{matrix}

In the last step, we used the fact that $E [z z^{T}] = Cov (z)$ (since $z$ has zero mean), and $E [z ϵ^{T}] = E [z] E [ϵ^{T}] = 0$ (since $z$ and $ϵ$ are independent, and hence the expectation of their product is the product of their expectations).Similarly, we can find $Σ_{x x}$ as follows:

\begin{matrix} E [(x - E [x]) {(x - E [x])}^{T}] & = & E [(μ + Λ z + ϵ - μ) {(μ + Λ z + ϵ - μ)}^{T}] \\ = & E [Λ z z^{T} Λ^{T} + ϵ z^{T} Λ^{T} + Λ z ϵ^{T} + ϵ ϵ^{T}] \\ = & Λ E [z z^{T}] Λ^{T} + E [ϵ ϵ^{T}] \\ = & Λ Λ^{T} + Ψ . \end{matrix}

Putting everything together, we therefore have that

[\begin{matrix} z \\ x \end{matrix}] \sim N ([\begin{matrix} \vec{0} \\ μ \end{matrix}], [\begin{matrix} I & Λ^{T} \\ Λ & Λ Λ^{T} + Ψ \end{matrix}]) .

Hence, we also see that the marginal distribution of $x$ is given by $x \sim N (μ, Λ Λ^{T} + Ψ)$ . Thus, given a training set ${x^{(i)}; i = 1, ..., m}$ , we can write down the log likelihood of the parameters:

ℓ (μ, Λ, Ψ) = log \prod_{i = 1}^{m} \frac{1}{{(2 π)}^{n / 2} {| Λ Λ^{T} + Ψ |}^{1 / 2}} exp (- \frac{1}{2} {(x^{(i)} - μ)}^{T} {(Λ Λ^{T} + Ψ)}^{- 1} (x^{(i)} - μ)) .

To perform maximum likelihood estimation, we would like to maximize this quantity with respect to the parameters. But maximizing this formula explicitly is hard (try it yourself),and we are aware of no algorithm that does so in closed-form. So, we will instead use to the EM algorithm. In the next section, we derive EM forfactor analysis.

Em for factor analysis

The derivation for the E-step is easy. We need to compute $Q_{i} (z^{(i)}) = p (z^{(i)} | x^{(i)}; μ, Λ, Ψ)$ . By substituting the distribution given in Equation [link] into the formulas [link] - [link] used for finding the conditional distribution of a Gaussian, wefind that $z^{(i)} | x^{(i)}; μ, Λ, Ψ \sim N (μ_{z^{(i)} | x^{(i)}}, Σ_{z^{(i)} | x^{(i)}})$ , where

\begin{matrix} μ_{z^{(i)} | x^{(i)}} & = & Λ^{T} {(Λ Λ^{T} + Ψ)}^{- 1} (x^{(i)} - μ), \\ Σ_{z^{(i)} | x^{(i)}} & = & I - Λ^{T} {(Λ Λ^{T} + Ψ)}^{- 1} Λ . \end{matrix}

So, using these definitions for $μ_{z^{(i)} | x^{(i)}}$ and $Σ_{z^{(i)} | x^{(i)}}$ , we have

Q_{i} (z^{(i)}) = \frac{1}{{(2 π)}^{k / 2} {| Σ_{z^{(i)} | x^{(i)}} |}^{1 / 2}} exp (- \frac{1}{2} {(z^{(i)} - μ_{z^{(i)} | x^{(i)}})}^{T} Σ_{z^{(i)} | x^{(i)}}^{- 1} (z^{(i)} - μ_{z^{(i)} | x^{(i)}})) .

Let's now work out the M-step. Here, we need to maximize

\sum_{i = 1}^{m} \int_{z^{(i)}} Q_{i} (z^{(i)}) log \frac{p (x^{(i)}, z^{(i)}; μ, Λ, Ψ)}{Q_{i} (z^{(i)})} d z^{(i)}

with respect to the parameters $μ, Λ, Ψ$ . We will work out only the optimization with respect to $Λ$ , and leave the derivations of the updates for $μ$ and $Ψ$ as an exercise to the reader.

We can simplify Equation [link] as follows:

\begin{matrix} \sum_{i = 1}^{m} \int_{z^{(i)}} Q_{i} (z^{(i)}) [log p (x^{(i)} | z^{(i)}; μ, Λ, Ψ) + log p (z^{(i)}) - log Q_{i} (z^{(i)})] d z^{(i)} \\ = & \sum_{i = 1}^{m} E_{z^{(i)} \sim Q_{i}} [log p (x^{(i)} | z^{(i)}; μ, Λ, Ψ) + log p (z^{(i)}) - log Q_{i} (z^{(i)})] \end{matrix}

Here, the “ $z^{(i)} \sim Q_{i}$ ” subscript indicates that the expectation is with respect to $z^{(i)}$ drawn from $Q_{i}$ . In the subsequent development, we will omit this subscript when there is no risk of ambiguity.Dropping terms that do not depend on the parameters, we find that we need to maximize:

\begin{matrix} \sum_{i = 1}^{m} E [log p (x^{(i)} | z^{(i)}; μ, Λ, Ψ)] \\ = & \sum_{i = 1}^{m} E [log \frac{1}{{(2 π)}^{n / 2} {| Ψ |}^{1 / 2}} exp (- \frac{1}{2} {(x^{(i)} - μ - Λ z^{(i)})}^{T} Ψ^{- 1} (x^{(i)} - μ - Λ z^{(i)}))] \\ = & \sum_{i = 1}^{m} E [- \frac{1}{2} log | Ψ | - \frac{n}{2} log (2 π) - \frac{1}{2} {(x^{(i)} - μ - Λ z^{(i)})}^{T} Ψ^{- 1} (x^{(i)} - μ - Λ z^{(i)})] \end{matrix}

Let's maximize this with respect to $Λ$ . Only the last term above depends on $Λ$ . Taking derivatives, and using the facts that $tr a = a$ (for $a \in R$ ), $tr A B = tr B A$ , and $\nabla_{A} tr A B A^{T} C = C A B + C^{T} A B$ , we get:

\begin{matrix} \nabla_{Λ} \sum_{i = 1}^{m} - E [\frac{1}{2}, {(x^{(i)} - μ - Λ z^{(i)})}^{T}, Ψ^{- 1}, (x^{(i)} - μ - Λ z^{(i)})] \\ = & \sum_{i = 1}^{m} \nabla_{Λ} E [- tr \frac{1}{2} {z^{(i)}}^{T} Λ^{T} Ψ^{- 1} Λ z^{(i)} + tr {z^{(i)}}^{T} Λ^{T} Ψ^{- 1} (x^{(i)} - μ)] \\ = & \sum_{i = 1}^{m} \nabla_{Λ} E [- tr \frac{1}{2} Λ^{T} Ψ^{- 1} Λ z^{(i)} {z^{(i)}}^{T} + tr Λ^{T} Ψ^{- 1} (x^{(i)} - μ) {z^{(i)}}^{T}] \\ = & \sum_{i = 1}^{m} E [- Ψ^{- 1} Λ z^{(i)} {z^{(i)}}^{T} + Ψ^{- 1} (x^{(i)} - μ) {z^{(i)}}^{T}] \end{matrix}

Setting this to zero and simplifying, we get:

\sum_{i = 1}^{m} Λ E_{z^{(i)} \sim Q_{i}} [z^{(i)}, {z^{(i)}}^{T}] = \sum_{i = 1}^{m} (x^{(i)} - μ) E_{z^{(i)} \sim Q_{i}} [{z^{(i)}}^{T}] .

Hence, solving for $Λ$ , we obtain

Λ = (\sum_{i = 1}^{m}, (x^{(i)} - μ), E_{z^{(i)} \sim Q_{i}}, [{z^{(i)}}^{T}]) {(\sum_{i = 1}^{m}, E_{z^{(i)} \sim Q_{i}}, [z^{(i)}, {z^{(i)}}^{T}])}^{- 1} .

It is interesting to note the close relationship between this equation and the normal equation that we'd derived for least squares regression,

`` θ^{T} = (y^{T} X) {(X^{T} X)}^{- 1} .''

The analogy is that here, the $x$ 's are a linear function of the $z$ 's (plus noise). Given the “guesses” for $z$ that the E-step has found, we will now try to estimate the unknown linearity $Λ$ relating the $x$ 's and $z$ 's. It is therefore no surprise that we obtain something similar to the normal equation. There is, however,one important difference between this and an algorithm that performs least squares using just the “best guesses” of the $z$ 's; we will see this difference shortly.

To complete our M-step update, let's work out the values of the expectations in Equation [link] . From our definition of $Q_{i}$ being Gaussian with mean $μ_{z^{(i)} | x^{(i)}}$ and covariance $Σ_{z^{(i)} | x^{(i)}}$ , we easily find

\begin{matrix} E_{z^{(i)} \sim Q_{i}} [{z^{(i)}}^{T}] & = & μ_{z^{(i)} | x^{(i)}}^{T} \\ E_{z^{(i)} \sim Q_{i}} [z^{(i)}, {z^{(i)}}^{T}] & = & μ_{z^{(i)} | x^{(i)}} μ_{z^{(i)} | x^{(i)}}^{T} + Σ_{z^{(i)} | x^{(i)}} . \end{matrix}

The latter comes from the fact that, for a random variable $Y$ , $Cov (Y) = E [Y Y^{T}] - E [Y] E {[Y]}^{T}$ , and hence $E [Y Y^{T}] = E [Y] E {[Y]}^{T} + Cov (Y)$ . Substituting this back into Equation [link] , we get the M-step update for $Λ$ :

Λ = (\sum_{i = 1}^{m}, (x^{(i)} - μ), μ_{z^{(i)} | x^{(i)}}^{T}) {(\sum_{i = 1}^{m} μ_{z^{(i)} | x^{(i)}} μ_{z^{(i)} | x^{(i)}}^{T} + Σ_{z^{(i)} | x^{(i)}})}^{- 1} .

It is important to note the presence of the $Σ_{z^{(i)} | x^{(i)}}$ on the right hand side of this equation. This is the covariance in the posterior distribution $p (z^{(i)} | x^{(i)})$ of $z^{(i)}$ give $x^{(i)}$ , and the M-step must take into account this uncertainty about $z^{(i)}$ in the posterior. A common mistake in deriving EM is to assume that in the E-step, we need to calculate only expectation $E [z]$ of the latent random variable $z$ , and then plug that into the optimization in the M-step everywhere $z$ occurs. While this worked for simple problems such as the mixture of Gaussians, in ourderivation for factor analysis, we needed $E [z z^{T}]$ as well $E [z]$ ; and as we saw, $E [z z^{T}]$ and $E [z] E {[z]}^{T}$ differ by the quantity $Σ_{z | x}$ . Thus, the M-step update must take into account the covariance of $z$ in the posterior distribution $p (z^{(i)} | x^{(i)})$ .

Lastly, we can also find the M-step optimizations for the parameters $μ$ and $Ψ$ . It is not hard to show that the first is given by

μ = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} .

Since this doesn't change as the parameters are varied (i.e., unlike the update for $Λ$ , the right hand sidedoes not depend on $Q_{i} (z^{(i)}) = p (z^{(i)} | x^{(i)}; μ, Λ, Ψ)$ , which in turn depends on the parameters), this can be calculated just once and needs not be further updated as the algorithm isrun. Similarly, the diagonal $Ψ$ can be found by calculating

Φ = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} {x^{(i)}}^{T} - x^{(i)} μ_{z^{(i)} | x^{(i)}}^{T} Λ^{T} - Λ μ_{z^{(i)} | x^{(i)}} {x^{(i)}}^{T} + Λ (μ_{z^{(i)} | x^{(i)}} μ_{z^{(i)} | x^{(i)}}^{T} + Σ_{z^{(i)} | x^{(i)}}) Λ^{T},

and setting $Ψ_{i i} = Φ_{i i}$ (i.e., letting $Ψ$ be the diagonal matrix containing only the diagonal entries of $Φ$ ).

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

	Biology Exam 2 By Vanessa Soledad Start Exam
	American Politics MCQ By Nicole Bartels Start Quiz
	Job Search Skills MCQ By Mary Matera Start Quiz
	20 Sociology 20 Population Urbanization Environment By OpenStax Start Quiz
	20 AP 20 Blood Vessels Circulation Essay By OpenStax Start Flashcards
	NCE Ch 11 Counseling Families, Diagnosis... By Anh Dao Start Quiz
©flickr: Abraham	Multicellular Organisms Test By Monty Hartfield Start Test
©flickr: U.S.	Biology Chapter 10 By Michael Sag Start Exam
	Professional Writing MCQ By Mary Cohen Start Quiz
	11 Sociology 11 Race and Ethnicity MCQ By OpenStax Start Quiz