<< Chapter < Page Chapter >> Page >

Softmax regression

Let's look at one more example of a GLM. Consider a classification problem in which the response variable y can take on any one of k values, so y { 1 , 2 , ... , k } . For example, rather than classifying email into the two classes spam or not-spam—which would have been a binaryclassification problem—we might want to classify it into three classes, such as spam, personal mail, and work-related mail. The response variable isstill discrete, but can now take on more than two values. We will thus model it as distributed according to a multinomial distribution.

Let's derive a GLM for modelling this type of multinomial data. To do so, we will begin by expressing the multinomial as an exponential family distribution.

To parameterize a multinomial over k possible outcomes, one could use k parameters Φ 1 , ... , Φ k specifying the probability of each of the outcomes. However, these parameters would be redundant, or more formally, they would not beindependent (since knowing any k - 1 of the Φ i 's uniquely determines the last one, as they must satisfy i = 1 k Φ i = 1). So, we will instead parameterize the multinomial with only k - 1 parameters, Φ 1 , ... , Φ k - 1 , where Φ i = p ( y = i ; Φ ) , and p ( y = k ; Φ ) = 1 - i = 1 k - 1 Φ i . For notational convenience, we will also let Φ k = 1 - i = 1 k - 1 Φ i , but we should keep in mind that this is not a parameter, and that it is fully specified by Φ 1 , ... , Φ k - 1 .

To express the multinomial as an exponential family distribution, we will define T ( y ) R k - 1 as follows:

T ( 1 ) = 1 0 0 0 , T ( 2 ) = 0 1 0 0 , T ( 3 ) = 0 0 1 0 , , T ( k - 1 ) = 0 0 0 1 , T ( k ) = 0 0 0 0 ,

Unlike our previous examples, here we do not have T ( y ) = y ; also, T ( y ) is now a k - 1 dimensional vector, rather than a real number. We will write ( T ( y ) ) i to denote the i -th element of the vector T ( y ) .

We introduce one more very useful piece of notation. An indicator function 1 { · } takes on a value of 1 if its argument is true, and 0 otherwise ( 1 { True } = 1 , 1 { False } = 0 ). For example, 1 { 2 = 3 } = 0 , and 1 { 3 = 5 - 2 } = 1 . So, we can also write the relationship between T ( y ) and y as ( T ( y ) ) i = 1 { y = i } . (Before you continue reading, please make sure you understand why this is true!) Further, wehave that E [ ( T ( y ) ) i ] = P ( y = i ) = Φ i .

We are now ready to show that the multinomial is a member of the exponential family. We have:

p ( y ; Φ ) = Φ 1 1 { y = 1 } Φ 2 1 { y = 2 } Φ k 1 { y = k } = Φ 1 1 { y = 1 } Φ 2 1 { y = 2 } Φ k 1 - i = 1 k - 1 1 { y = i } = Φ 1 ( T ( y ) ) 1 Φ 2 ( T ( y ) ) 2 Φ k 1 - i = 1 k - 1 ( T ( y ) ) i = exp ( ( T ( y ) ) 1 log ( Φ 1 ) + ( T ( y ) ) 2 log ( Φ 2 ) + + 1 - i = 1 k - 1 ( T ( y ) ) i log ( Φ k ) ) = exp ( ( T ( y ) ) 1 log ( Φ 1 / Φ k ) + ( T ( y ) ) 2 log ( Φ 2 / Φ k ) + + ( T ( y ) ) k - 1 log ( Φ k - 1 / Φ k ) + log ( Φ k ) ) = b ( y ) exp ( η T T ( y ) - a ( η ) )

where

η = log ( Φ 1 / Φ k ) log ( Φ 2 / Φ k ) log ( Φ k - 1 / Φ k ) , a ( η ) = - log ( Φ k ) b ( y ) = 1 .

This completes our formulation of the multinomial as an exponential family distribution.

The link function is given (for i = 1 , ... , k ) by

η i = log Φ i Φ k .

For convenience, we have also defined η k = log ( Φ k / Φ k ) = 0 . To invert the link function and derive the response function, we therefore havethat

e η i = Φ i Φ k Φ k e η i = Φ i Φ k i = 1 k e η i = i = 1 k Φ i = 1

This implies that Φ k = 1 / i = 1 k e η i , which can be substituted back into Equation  [link] to give the response function

Φ i = e η i j = 1 k e η j

This function mapping from the η 's to the Φ 's is called the softmax function.

To complete our model, we use Assumption 3, given earlier, that the η i 's are linearly related to the x 's. So, have η i = θ i T x (for i = 1 , ... , k - 1 ), where θ 1 , ... , θ k - 1 R n + 1 are the parameters of our model. For notational convenience, we can also define θ k = 0 , so that η k = θ k T x = 0 , as given previously. Hence, our model assumes that the conditional distribution of y given x is given by

p ( y = i | x ; θ ) = Φ i = e η i j = 1 k e η j = e θ i T x j = 1 k e θ j T x

This model, which applies to classification problems where y { 1 , ... , k } , is called softmax regression . It is a generalization of logistic regression.

Our hypothesis will output

h θ ( x ) = E [ T ( y ) | x ; θ ] = E 1 y = 1 1 y = 2 1 y = k - 1 x ; θ = Φ 1 Φ 2 Φ k - 1 = exp θ 1 T x j = 1 k exp θ j T x exp θ 2 T x j = 1 k exp θ j T x exp θ k - 1 T x j = 1 k exp θ j T x .

In other words, our hypothesis will output the estimated probability that p ( y = i | x ; θ ) , for every value of i = 1 , ... , k . (Even though h θ ( x ) as defined above is only k - 1 dimensional, clearly p ( y = k | x ; θ ) can be obtained as 1 - i = 1 k - 1 Φ i .)

Lastly, let's discuss parameter fitting. Similar to our original derivation of ordinary least squares and logistic regression, if we have a training set of m examples { ( x ( i ) , y ( i ) ) ; i = 1 , ... , m } and would like to learn the parameters θ i of this model, we would begin by writing down the log-likelihood

( θ ) = i = 1 m log p ( y ( i ) | x ( i ) ; θ ) = i = 1 m log l = 1 k e θ l T x ( i ) j = 1 k e θ j T x ( i ) 1 { y ( i ) = l }

To obtain the second line above, we used the definition for p ( y | x ; θ ) given in Equation  [link] . We can now obtain the maximum likelihood estimate of the parameters by maximizing ( θ ) in terms of θ , using a method such as gradient ascent or Newton's method.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask