<< Chapter < Page Chapter >> Page >

Optionally, step 3 in the algorithm may also be replaced with selecting the model M i according to arg min i ε ^ S cv ( h i ) , and then retraining M i on the entire training set S . (This is often a good idea, with one exception being learning algorithms that are be very sensitive toperturbations of the initial conditions and/or data. For these methods, M i doing well on S train does not necessarily mean it will also do well on S cv , and it might be better to forgo this retraining step.)

The disadvantage of using hold out cross validation is that it “wastes” about 30% of the data. Even if we were to take the optional step of retraining the model on theentire training set, it's still as if we're trying to find a good model for a learning problem in which we had 0 . 7 m training examples, rather than m training examples, since we're testing models that were trained on only 0 . 7 m examples each time. While this is fine if data is abundant and/or cheap, in learning problems in which data is scarce (consider a problem with m = 20 , say), we'd like to do something better.

Here is a method, called k -fold cross validation , that holds out less data each time:

  1. Randomly split S into k disjoint subsets of m / k training examples each. Let's call these subsets S 1 , ... , S k .
  2. For each model M i , we evaluate it as follows:
    1. For j = 1 , ... , k
      1. Train the model M i on S 1 S j - 1 S j + 1 S k (i.e., train on all the data except S j ) to get some hypothesis h i j .
      2. Test the hypothesis h i j on S j , to get ε ^ S j ( h i j ) .
    2. The estimated generalization error of model M i is then calculated as the average of the ε ^ S j ( h i j ) 's (averaged over j ).
  3. Pick the model M i with the lowest estimated generalization error, and retrain that model on the entire training set S . The resulting hypothesis is then output as our final answer.

A typical choice for the number of folds to use here would be k = 10 . While the fraction of data held out each time is now 1 / k —much smaller than before—this procedure may also be more computationally expensive than hold-out crossvalidation, since we now need train to each model k times.

While k = 10 is a commonly used choice, in problems in which data is really scarce, sometimes we will use the extreme choice of k = m in order to leave out as little data as possibleeach time. In this setting, we would repeatedly train on all but one of the training examples in S , and test on that held-out example. The resulting m = k errors are then averaged together to obtain our estimate of the generalization error of a model.This method has its own name; since we're holding out one training example at a time,this method is called leave-one-out cross validation.

Finally, even though we have described the different versions of cross validation as methods for selecting a model, they canalso be used more simply to evaluate a single model or algorithm. For example, if you have implemented some learningalgorithm and want to estimate how well it performs for your application (or if you have invented a novel learning algorithmand want to report in a technical paper how well it performs on various test sets), cross validation would give a reasonableway of doing so.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask