<< Chapter < Page Chapter >> Page >

Review : maximum likelihood estimation

In the last lecture , we have n i.i.d observations drawn from an unknown distribution

Y i i . i . d . p θ * , i = { 1 , ... , n }
where θ * Θ .

With loss function defined as l ( θ , Y i ) = - log p θ ( Y i ) , the empirical risk is

R n ^ = - 1 n i = 1 n log p θ ( Y i ) .

Essentially, we want to choose a distribution from the collection of distributions within the parameter space that minimizes the empirical risk, i.e., we would like to select

p θ n ^ P = { p θ } θ Θ

where

θ n ^ = arg min θ Θ - i = 1 n log p θ ( Y i ) .

The risk is defined as

R ( θ ) = E [ l ( θ , Y ) ] = - E [ log p θ ( Y ) ] .

Note that θ * minimizes R ( θ ) over Θ .

θ * = arg min θ Θ - E [ log p θ ( Y ) ] = arg min θ Θ - log p θ ( y ) · p θ * ( y ) d y .

Finally, the excess risk of θ is defined as

R ( θ ) - R ( θ * ) = log p θ * ( y ) p θ ( y ) p θ * ( y ) d y K ( p θ , p θ * ) .

We recognized that the excess risk corresponding to this loss function is simply the Kullback-Leibler (KL) Divergence or Relative Entropy , denoted by K ( p θ 1 , p θ 2 ) . It is easy to see that K ( p θ 1 , p θ 2 ) is always non-negative and is zero if and only if p θ 1 = p θ 2 . KL divergence measures how different two probability distributions are and therefore is natural to measure convergence of the maximum likelihood procedures. However, K ( p θ 1 , p θ 2 ) is not a distance metric because it is not symmetric and does not satisfy the triangle inequality. For this reason, two other quantities play a key role in maximum likelihood estimation, namely Hellinger Distance and Affinity .

The Hellinger distance is defined as

H ( p θ 1 , p θ 2 ) = p θ 1 ( y ) - p θ 2 ( y ) 2 d y 1 2 .

We proved that the squared Hellinger distance lower bounds the KL divergence:

H 2 ( p θ 1 , p θ 2 ) K ( p θ 1 , p θ 2 ) H 2 ( p θ 1 , p θ 2 ) K ( p θ 2 , p θ 1 ) .

The affinity is defined as

A ( p θ 1 , p θ 2 ) = p θ 1 · p θ 2 ( y ) d y .

we also proved that

H 2 ( p θ 1 , p θ 2 ) - 2 log A ( p θ 1 , p θ 2 ) .

Gaussian distribution

Y is Gaussian with mean θ and variance σ 2 .

p θ ( y ) = 1 2 π σ 2 e - ( y - θ ) 2 2 σ 2 .

First, look at

log p θ 2 p θ 1 = 1 2 σ 2 [ ( θ 1 2 - θ 2 2 ) - 2 ( θ 1 - θ 2 ) y ] .

Then,

K ( p θ 1 , p θ 2 ) = E θ 2 log p θ 2 p θ 1 = θ 1 2 - θ 2 2 2 σ 2 - 2 ( θ 1 - θ 2 ) 2 σ 2 y · p θ 2 ( y ) d y E [ Y ] = θ 2 = 1 2 σ 2 ( θ 1 2 + θ 2 2 - 2 θ 1 θ 2 ) = ( θ 1 2 - θ 2 ) 2 2 σ 2 . - 2 log A ( p θ 1 , p θ 2 ) = - 2 log 1 2 π σ 2 e - ( y - θ 1 ) 2 2 σ 2 1 / 2 · 1 2 π σ 2 e - ( y - θ 2 ) 2 2 σ 2 1 / 2 d y = - 2 log 1 2 π σ 2 e - ( y - θ 1 ) 2 4 σ 2 - ( y - θ 2 ) 2 4 σ 2 d y = - 2 log 1 2 π σ 2 e - 1 2 σ 2 y - θ 1 + θ 2 2 2 + θ 1 - θ 2 2 2 d y = - 2 log e - θ 1 - θ 2 2 2 2 σ 2 = ( θ 1 - θ 2 ) 2 4 σ 2 = 1 2 K ( p θ 1 , p θ 2 ) H 2 ( p θ 1 , p θ 2 ) .

Maximum likelihood estimation and complexity regularization

Suppose that we have n i.i.d training samples, { X i , Y i } i = 1 n i . i . d . p X Y  .

Using conditional probability, p X Y can be written as

p X Y ( x , y ) = p X ( x ) · p Y | X = x ( y ) .

Let's assume for the moment that p X is completely unknown, but p Y | X = x ( y ) has a special form:

p Y | X = x ( y ) = p f * ( x ) ( y )

where p Y | X = x ( y ) is a known parametric density function with parameter f * ( x ) .

Signal-plus-noise observation model

Y i = f * ( X i ) + W i , i = 1 , ... , n

where W i i . i . d . N ( 0 , σ 2 ) and X i i . i . d . p X  .

p f * ( x ) ( y ) = 1 2 π σ 2 e - ( y - f * ( x ) ) 2 2 σ 2

Y | X = x Poisson ( f * ( x ) )

p f * ( x ) ( y ) = e - f * ( x ) [ f * ( x ) ] y y ! .

The likelihood loss function is

l ( f ( x ) , y ) = - log p X Y ( X , Y ) = - log p X ( X ) - log p Y | X ( Y | X ) = - log p X ( X ) - log p f ( X ) ( Y ) .

The expected loss is

E [ l ( f ( X ) , Y ) ] = E X E Y | X [ l ( f ( X ) , Y ) | X = x ] = E X [ E Y | X [ - log p X ( x ) - log p f ( x ) ( Y ) | X = x ] ] = - E X [ log p X ( X ) ] - E X [ E Y | X [ log p f ( x ) ( Y ) | X = x ] ] = - E X [ log p X ( X ) ] - E [ log p f ( X ) ( Y ) ] .

Notice that the first term is a constant with respect to f .

Hence, we define our risk to be

R ( f ) = - E [ log p f ( X ) ( Y ) ] = - E X [ E Y | X [ log p f ( x ) ( Y ) | X = x ] ] = - log p f ( x ) ( y ) · p f * ( x ) ( y ) d y p X ( x ) d x .

The function f * minimizes this risk since f ( x ) = f * ( x ) minimizes the integrand.

Our empirical risk is the negative log-likelihood of the training samples:

R n ^ ( f ) = 1 n i = 1 n - log p f ( X i ) ( Y i ) .

The value 1 n is the empirical probability of observing X = X i .

Often in function estimation, we have control over where we sample X. Let's assume that X = [ 0 , 1 ] d and Y = R . Suppose we sample X uniformly with n = m d samples for some positive integer m ( i.e., ,take m evenly spaced samples in each coordinate).

Let x i , i = 1 , ... , n denote these sample points, and assume that Y i p f * ( x i ) ( y ) . Then, our empirical risk is

R n ^ ( f ) = 1 n i = 1 n l ( f ( x i ) , Y i ) = 1 n i = 1 n - log p f ( x i ) ( Y i ) .

Note that x i is now a deterministic quantity.

Our risk is

R ( f ) = - 1 n i = 1 n E log p f ( x i ) ( Y i ) = - 1 n i = 1 n log p f ( x i ) ( y i ) · p f * ( x i ) ( y i ) d y i .

The risk is minimized by f * . However, f * is not a unique minimizer. Any f that agrees with f * at the point { x i , Y i } also minimizes this risk.

Now, we will make use of the following vector and shorthand notation. The uppercase Y denotes a random variable, while the lowercase y and x denote deterministic quantities.

Y = Y 1 Y 2 Y n y = y 1 y 2 y n x = x 1 x 2 x n

Then,

p f ( Y ) = i = 1 n p Y i | f ( x i )    (random)

p f ( y ) = i = 1 n p y i | f ( x i )    (deterministic) .

With this notation, the empirical risk and the true risk can be written as

R n ^ ( f ) = - 1 n log p f ( Y ) . R ( f ) = - 1 n E [ log p f ( Y ) ] = - 1 n log p f ( y ) · p f * ( y ) d y .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask