<< Chapter < Page | Chapter >> Page > |
To make our discussion of SVMs easier, we'll first need to introduce a new notation for talking about classification.We will be considering a linear classifier for a binary classification problem with labels $y$ and features $x$ . From now, we'll use $y\in \{-1,1\}$ (instead of $\{0,1\}$ ) to denote the class labels. Also, rather than parameterizing our linearclassifier with the vector $\theta $ , we will use parameters $w,b$ , and write our classifier as
Here, $g\left(z\right)=1$ if $z\ge 0$ , and $g\left(z\right)=-1$ otherwise. This “ $w,b$ ” notation allows us to explicitly treat the intercept term $b$ separately from the other parameters. (We also drop the convention we had previously of letting ${x}_{0}=1$ be an extra coordinate in the input feature vector.) Thus, $b$ takes the role of what was previously ${\theta}_{0}$ , and $w$ takes the role of ${[{\theta}_{1}...{\theta}_{n}]}^{T}$ .
Note also that, from our definition of $g$ above, our classifier will directly predict either 1 or $-1$ (cf. the perceptron algorithm), without first going through the intermediate step of estimating the probability of $y$ being 1 (which was what logistic regression did).
Let's formalize the notions of the functional and geometric margins. Given a training example $({x}^{\left(i\right)},{y}^{\left(i\right)})$ , we define the functional margin of $(w,b)$ with respect to the training example
Note that if ${y}^{\left(i\right)}=1$ , then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need ${w}^{T}x+b$ to be a large positive number. Conversely, if ${y}^{\left(i\right)}=-1$ , then for the functional margin to be large, we need ${w}^{T}x+b$ to be a large negative number. Moreover, if ${y}^{\left(i\right)}({w}^{T}x+b)>0$ , then our prediction on this example is correct. (Check this yourself.)Hence, a large functional margin represents a confident and a correct prediction.
For a linear classifier with the choice of $g$ given above (taking values in $\{-1,1\}$ ), there's one property of the functional margin that makes it not a very good measure of confidence,however. Given our choice of $g$ , we note that if we replace $w$ with $2w$ and $b$ with $2b$ , then since $g({w}^{T}x+b)=g(2{w}^{T}x+2b)$ , this would not change ${h}_{w,b}\left(x\right)$ at all. I.e., $g$ , and hence also ${h}_{w,b}\left(x\right)$ , depends only on the sign, but not on the magnitude, of ${w}^{T}x+b$ . However, replacing $(w,b)$ with $(2w,2b)$ also results in multiplying our functional margin by a factor of 2. Thus, it seems that by exploiting our freedom to scale $w$ and $b$ , we can make the functional margin arbitrarily large without really changing anything meaningful. Intuitively, it might therefore make senseto impose some sort of normalization condition such as that ${\left|\right|w\left|\right|}_{2}=1$ ; i.e., we might replace $(w,b)$ with $(w/|\left|w\right|{|}_{2},b/\left|\right|w\left|{|}_{2}\right)$ , and instead consider the functional margin of $(w/|\left|w\right|{|}_{2},b/\left|\right|w\left|{|}_{2}\right)$ . We'll come back to this later.
Given a training set $S=\{({x}^{\left(i\right)},{y}^{\left(i\right)});i=1,...,m\}$ , we also define the function margin of $(w,b)$ with respect to $S$ to be the smallest of the functional margins of the individual training examples. Denotedby $\widehat{\gamma}$ , this can therefore be written:
Next, let's talk about geometric margins . Consider the picture below:
The decision boundary corresponding to $(w,b)$ is shown, along with the vector $w$ . Note that $w$ is orthogonal (at ${90}^{\circ}$ ) to the separating hyperplane. (You should convince yourself that this must be the case.) Consider the point at A, which represents the input ${x}^{\left(i\right)}$ of some training example with label ${y}^{\left(i\right)}=1$ . Its distance to the decision boundary, ${\gamma}^{\left(i\right)}$ , is given by the line segment AB.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?