<< Chapter < Page Chapter >> Page >
θ : = θ - H - 1 θ ( θ ) .

Here, θ ( θ ) is, as usual, the vector of partial derivatives of ( θ ) with respect to the θ i 's; and H is an n -by- n matrix (actually, n + 1 -by- n + 1 , assuming that we include the intercept term) called the Hessian , whose entries are given by

H i j = 2 ( θ ) θ i θ j .

Newton's method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to theminimum. One iteration of Newton's can, however, be more expensive than one iteration of gradient descent, since it requires finding andinverting an n -by- n Hessian; but so long as n is not too large, it is usually much faster overall. When Newton's method is applied to maximize thelogistic regression log likelihood function ( θ ) , the resulting method is also called Fisher scoring .

Generalized linear models The presentation of the material in this section takes inspiration from Michael I. Jordan, Learning in graphical models (unpublished book draft), and also McCullagh and Nelder, Generalized Linear Models (2nd ed.) .

So far, we've seen a regression example, and a classification example. In the regression example, we had y | x ; θ N ( μ , σ 2 ) , and in the classification one, y | x ; θ Bernoulli ( Φ ) , for some appropriate definitions of μ and Φ as functions of x and θ . In this section, we will show that both of these methods are special cases of a broader family of models, calledGeneralized Linear Models (GLMs). We will also show how other models in the GLM family can be derived and applied to other classificationand regression problems.

The exponential family

To work our way up to GLMs, we will begin by defining exponential family distributions. We say that a class of distributions is in the exponential family if it can be writtenin the form

p ( y ; η ) = b ( y ) exp ( η T T ( y ) - a ( η ) )

Here, η is called the natural parameter (also called the canonical parameter ) of the distribution; T ( y ) is the sufficient statistic (for the distributions we consider, it will often be the case that T ( y ) = y ); and a ( η ) is the log partition function . The quantity e - a ( η ) essentially plays the role of a normalization constant, that makes sure the distribution p ( y ; η ) sums/integrates over y to 1.

A fixed choice of T , a and b defines a family (or set) of distributions that is parameterized by η ; as we vary η , we then get different distributions within this family.

We now show that the Bernoulli and the Gaussian distributions are examples of exponential family distributions. The Bernoulli distribution with mean Φ , written Bernoulli ( Φ ) , specifies a distribution over y { 0 , 1 } , so that p ( y = 1 ; Φ ) = Φ ; p ( y = 0 ; Φ ) = 1 - Φ . As we vary Φ , we obtain Bernoulli distributions with different means. We now show that this class of Bernoullidistributions, ones obtained by varying Φ , is in the exponential family; i.e., that there is a choice of T , a and b so that Equation  [link] becomes exactly the class of Bernoulli distributions.

We write the Bernoulli distribution as:

p ( y ; Φ ) = Φ y ( 1 - Φ ) 1 - y = exp ( y log Φ + ( 1 - y ) log ( 1 - Φ ) ) = exp log Φ 1 - Φ y + log ( 1 - Φ ) .

Thus, the natural parameter is given by η = log ( Φ / ( 1 - Φ ) ) . Interestingly, if we invert this definition for η by solving for Φ in terms of η , we obtain Φ = 1 / ( 1 + e - η ) . This is the familiar sigmoid function! This will come up again when we derive logisticregression as a GLM. To complete the formulation of the Bernoulli distribution as an exponential familydistribution, we also have

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask