<< Chapter < Page Chapter >> Page >

In the original linear regression algorithm, to make a prediction at a query point x (i.e., to evaluate h ( x ) ), we would:

  1. Fit θ to minimize i ( y ( i ) - θ T x ( i ) ) 2 .
  2. Output θ T x .

In contrast, the locally weighted linear regression algorithm does the following:

  1. Fit θ to minimize i w ( i ) ( y ( i ) - θ T x ( i ) ) 2 .
  2. Output θ T x .

Here, the w ( i ) 's are non-negative valued weights . Intuitively, if w ( i ) is large for a particular value of i , then in picking θ , we'll try hard to make ( y ( i ) - θ T x ( i ) ) 2 small. If w ( i ) is small, then the ( y ( i ) - θ T x ( i ) ) 2 error term will be pretty much ignored in the fit.

A fairly standard choice for the weights is If x is vector-valued, this is generalized to be w ( i ) = exp ( - ( x ( i ) - x ) T ( x ( i ) - x ) / ( 2 τ 2 ) ) , or w ( i ) = exp ( - ( x ( i ) - x ) T Σ - 1 ( x ( i ) - x ) / 2 ) , for an appropriate choice of τ or Σ .

w ( i ) = exp - ( x ( i ) - x ) 2 2 τ 2

Note that the weights depend on the particular point x at which we're trying to evaluate x . Moreover, if | x ( i ) - x | is small, then w ( i ) is close to 1; and if | x ( i ) - x | is large, then w ( i ) is small. Hence, θ is chosen giving a much higher “weight” to the (errors on) training examples close to the query point x . (Note also that while the formula for the weights takes a form that is cosmetically similar to thedensity of a Gaussian distribution, the w ( i ) 's do not directly have anything to do with Gaussians, and in particular the w ( i ) are not random variables, normally distributed or otherwise.)The parameter τ controls how quickly the weight of a training example falls off with distance of its x ( i ) from the query point x ; τ is called the bandwidth parameter, and is also something that you'll get to experiment with in your homework.

Locally weighted linear regression is the first example we're seeing of a non-parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters(the θ i 's), which are fit to the data. Once we've fit the θ i 's and stored them away, we no longer need to keep the training data around to make future predictions.In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric”(roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis h grows linearly with the size of the training set.

Classification and logistic regression

Let's now talk about the classification problem. This is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values. For now,we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what wesay here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email,then x ( i ) may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise.0 is also called the negative class , and 1 the positive class , and they are sometimes also denoted by the symbols “-” and “+.” Given x ( i ) , the corresponding y ( i ) is also called the label for the training example.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask