<< Chapter < Page | Chapter >> Page > |
Let's look at one more example of a GLM. Consider a classification problem in which the response variable $y$ can take on any one of $k$ values, so $y\in \{1,2,...,k\}$ . For example, rather than classifying email into the two classes spam or not-spam—which would have been a binaryclassification problem—we might want to classify it into three classes, such as spam, personal mail, and work-related mail. The response variable isstill discrete, but can now take on more than two values. We will thus model it as distributed according to a multinomial distribution.
Let's derive a GLM for modelling this type of multinomial data. To do so, we will begin by expressing the multinomial as an exponential family distribution.
To parameterize a multinomial over $k$ possible outcomes, one could use $k$ parameters ${\Phi}_{1},...,{\Phi}_{k}$ specifying the probability of each of the outcomes. However, these parameters would be redundant, or more formally, they would not beindependent (since knowing any $k-1$ of the ${\Phi}_{i}$ 's uniquely determines the last one, as they must satisfy ${\sum}_{i=1}^{k}{\Phi}_{i}$ = 1). So, we will instead parameterize the multinomial with only $k-1$ parameters, ${\Phi}_{1},...,{\Phi}_{k-1}$ , where ${\Phi}_{i}=p(y=i;\Phi )$ , and $p(y=k;\Phi )=1-{\sum}_{i=1}^{k-1}{\Phi}_{i}$ . For notational convenience, we will also let ${\Phi}_{k}=1-{\sum}_{i=1}^{k-1}{\Phi}_{i}$ , but we should keep in mind that this is not a parameter, and that it is fully specified by ${\Phi}_{1},...,{\Phi}_{k-1}$ .
To express the multinomial as an exponential family distribution, we will define $T\left(y\right)\in {\mathbb{R}}^{k-1}$ as follows:
Unlike our previous examples, here we do not have $T\left(y\right)=y$ ; also, $T\left(y\right)$ is now a $k-1$ dimensional vector, rather than a real number. We will write ${\left(T\left(y\right)\right)}_{i}$ to denote the $i$ -th element of the vector $T\left(y\right)$ .
We introduce one more very useful piece of notation. An indicator function $1\{\xb7\}$ takes on a value of 1 if its argument is true, and 0 otherwise ( $1\left\{\mathrm{True}\right\}=1$ , $1\left\{\mathrm{False}\right\}=0$ ). For example, $1\{2=3\}=0$ , and $1\{3=5-2\}=1$ . So, we can also write the relationship between $T\left(y\right)$ and $y$ as ${\left(T\left(y\right)\right)}_{i}=1\{y=i\}$ . (Before you continue reading, please make sure you understand why this is true!) Further, wehave that $\mathrm{E}\left[{\left(T\left(y\right)\right)}_{i}\right]=P(y=i)={\Phi}_{i}$ .
We are now ready to show that the multinomial is a member of the exponential family. We have:
where
This completes our formulation of the multinomial as an exponential family distribution.
The link function is given (for $i=1,...,k$ ) by
For convenience, we have also defined ${\eta}_{k}=log({\Phi}_{k}/{\Phi}_{k})=0$ . To invert the link function and derive the response function, we therefore havethat
This implies that ${\Phi}_{k}=1/{\sum}_{i=1}^{k}{e}^{{\eta}_{i}}$ , which can be substituted back into Equation [link] to give the response function
This function mapping from the $\eta $ 's to the $\Phi $ 's is called the softmax function.
To complete our model, we use Assumption 3, given earlier, that the ${\eta}_{i}$ 's are linearly related to the $x$ 's. So, have ${\eta}_{i}={\theta}_{i}^{T}x$ (for $i=1,...,k-1$ ), where ${\theta}_{1},...,{\theta}_{k-1}\in {\mathbb{R}}^{n+1}$ are the parameters of our model. For notational convenience, we can also define ${\theta}_{k}=0$ , so that ${\eta}_{k}={\theta}_{k}^{T}x=0$ , as given previously. Hence, our model assumes that the conditional distribution of $y$ given $x$ is given by
This model, which applies to classification problems where $y\in \{1,...,k\}$ , is called softmax regression . It is a generalization of logistic regression.
Our hypothesis will output
In other words, our hypothesis will output the estimated probability that $p(y=i|x;\theta )$ , for every value of $i=1,...,k$ . (Even though ${h}_{\theta}\left(x\right)$ as defined above is only $k-1$ dimensional, clearly $p(y=k|x;\theta )$ can be obtained as $1-{\sum}_{i=1}^{k-1}{\Phi}_{i}$ .)
Lastly, let's discuss parameter fitting. Similar to our original derivation of ordinary least squares and logistic regression, if we have a training set of $m$ examples $\{({x}^{\left(i\right)},{y}^{\left(i\right)});i=1,...,m\}$ and would like to learn the parameters ${\theta}_{i}$ of this model, we would begin by writing down the log-likelihood
To obtain the second line above, we used the definition for $p\left(y\right|x;\theta )$ given in Equation [link] . We can now obtain the maximum likelihood estimate of the parameters by maximizing $\ell \left(\theta \right)$ in terms of $\theta $ , using a method such as gradient ascent or Newton's method.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?