<< Chapter < Page Chapter >> Page >
  1. Repeat until convergence {
    1. θ j : = θ j + α i = 1 m y ( i ) - h θ ( x ( i ) ) x j ( i ) (for every j ).
  2. }

The reader can easily verify that the quantity in the summation in the update rule above is just J ( θ ) / θ j (for the original definition of J ). So, this is simply gradient descent on the original cost function J . This method looks at every example in the entire training set on every step, and iscalled batch gradient descent . Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we haveposed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

a distribution where most are around the center with a line of points leading outside

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initializedat (48,30). The x 's in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through.

When we run batch gradient descent to fit θ on our previous dataset, to learn to predict housing price as a function of living area, weobtain θ 0 = 71 . 27 , θ 1 = 0 . 1345 . If we plot h θ ( x ) as a function of x (area), along with the training data, we obtain thefollowing figure:

the housing prices graph with a line drawn to represent the distribution. vaguely x=2y

If the number of bedrooms were included as one of the input features as well, we get θ 0 = 89 . 60 , θ 1 = 0 . 1392 , θ 2 = - 8 . 738 .

The above results were obtained with batch gradient descent. There is an alternative to batch gradient descent that also works very well.Consider the following algorithm:

  1. Loop {
    1. for i=1 to m, {
      1. θ j : = θ j + α y ( i ) - h θ ( x ( i ) ) x j ( i ) (for every j ).
    2. }
  2. }

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parametersaccording to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent ). Whereas batch gradient descent has to scan through the entire training setbefore taking a single step—a costly operation if m is large—stochastic gradient descent can start making progress right away, and continues to makeprogress with each example it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum,and the parameters θ will keep oscillating around the minimum of J ( θ ) ; but in practice most of the values near the minimum will be reasonably good approximations to the true minimum. While it is more common to run stochastic gradient descent as we have described it and with afixed learning rate α , by slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillatearound the minimum. ) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask