<< Chapter < Page Chapter >> Page >

The application of kernels to support vector machines should already be clear and so we won't dwell too much longer on it here. Keep in mind however that the idea of kernels hassignificantly broader applicability than SVMs. Specifically, if you have any learning algorithm that you can write in terms of only inner products x , z between input attribute vectors, then by replacing this with K ( x , z ) where K is a kernel, you can “magically” allow your algorithm to work efficiently in the high dimensional feature space corresponding to K . For instance, this kernel trick can be applied with the perceptron to to derive a kernel perceptron algorithm. Many of the algorithms that we'll see later in thisclass will also be amenable to this method, which has come to be known as the “kernel trick.”

Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data is linearly separable. While mapping data to a high dimensional feature space via Φ does generally increase the likelihood that the data is separable, we can't guarantee that it always will be so.Also, in some cases it is not clear that finding a separating hyperplane is exactly what we'd want to do, since that mightbe susceptible to outliers. For instance, the left figure below shows an optimal margin classifier, and when a single outlier is added in the upper-left region (right figure), it causes the decisionboundary to make a dramatic swing, and the resulting classifier has a much smaller margin.

the original data set with a line drawn through it
same data set and line with an intersecting line

To make the algorithm work for non-linearly separable datasets as well as be less sensitive to outliers, we reformulate our optimization (using 1 regularization ) as follows:

min γ , w , b 1 2 | | w | | 2 + C i = 1 m ξ i s.t. y ( i ) ( w T x ( i ) + b ) 1 - ξ i , i = 1 , ... , m ξ i 0 , i = 1 , ... , m .

Thus, examples are now permitted to have (functional) margin less than 1, and if an example has functional margin 1 - ξ i (with ξ > 0 ), we would pay a cost of the objective function being increased by C ξ i . The parameter C controls the relative weighting between the twin goals of making the | | w | | 2 small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1.

As before, we can form the Lagrangian:

L ( w , b , ξ , α , r ) = 1 2 w T w + C i = 1 m ξ i - i = 1 m α i y ( i ) ( x T w + b ) - 1 + ξ i - i = 1 m r i ξ i .

Here, the α i 's and r i 's are our Lagrange multipliers (constrained to be 0 ). We won't go through the derivation of the dual again in detail, but after setting the derivatives with respectto w and b to zero as before, substituting them back in, and simplifying, we obtain the following dual form of the problem:

max α W ( α ) = i = 1 m α i - 1 2 i , j = 1 m y ( i ) y ( j ) α i α j x ( i ) , x ( j ) s.t. 0 α i C , i = 1 , ... , m i = 1 m α i y ( i ) = 0 ,

As before, we also have that w can be expressed in terms of the α i 's as given in  [link] , so that after solving the dual problem, we can continue to use  [link] to make our predictions. Note that, somewhat surprisingly, in adding 1 regularization, the only change to the dual problem is that what was originally a constraint that 0 α i has now become 0 α i C . The calculation for b * also has to be modified ( [link] is no longer valid); see the comments in the next section/Platt's paper.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask