<< Chapter < Page Chapter >> Page >

Feature selection

One special and important case of model selection is called feature selection. To motivate this, imagine that you havea supervised learning problem where the number of features n is very large (perhaps n m ), but you suspect that there is only a small number of features that are “relevant” to thelearning task. Even if you use a simple linear classifier (such as theperceptron) over the n input features, the VC dimension of your hypothesis class would still be O ( n ) , and thus overfitting would be a potential problem unless thetraining set is fairly large.

In such a setting, you can apply a feature selection algorithm to reduce the number of features. Given n features, there are 2 n possible feature subsets (since each of the n features can either be included or excluded from the subset), and thus feature selection can beposed as a model selection problem over 2 n possible models. For large values of n , it's usually too expensive to explicitly enumerate over and compare all 2 n models, and so typically some heuristic search procedure is used to find agood feature subset. The following search procedure is called forward search :

  1. Initialize F = .
  2. Repeat {
    1. For i = 1 , ... , n if i F , let F i = F { i } , and use some version of cross validation to evaluate features F i . (i.e., train your learning algorithm using only the features in F i , and estimate its generalization error.)
    2. Set F to be the best feature subset found on step (a).
  3. }
  4. Select and output the best feature subset that was evaluated during the entire search procedure.

The outer loop of the algorithm can be terminated either when F = { 1 , ... , n } is the set of all features, or when | F | exceeds some pre-set threshold (corresponding to the maximum number of features that you want the algorithmto consider using).

This algorithm described above one instantiation of wrapper model feature selection , since it is a procedure that “wraps” around your learning algorithm,and repeatedly makes calls to the learning algorithm to evaluate how well it does using different feature subsets. Aside from forwardsearch, other search procedures can also be used. For example, backward search starts off with F = { 1 , ... , n } as the set of all features, and repeatedly deletes features one at a time (evaluating single-featuredeletions in a similar manner to how forward search evaluates single-feature additions) until F = .

Wrapper feature selection algorithms often work quite well, but can be computationally expensive given how that they need to makemany calls to the learning algorithm. Indeed, complete forward search (terminating when F = { 1 , ... , n } ) would take about O ( n 2 ) calls to the learning algorithm.

Filter feature selection methods give heuristic, but computationally much cheaper, ways of choosing a feature subset.The idea here is to compute some simple score S ( i ) that measures how informative each feature x i is about the class labels y . Then, we simply pick the k features with the largest scores S ( i ) .

One possible choice of the score would be define S ( i ) to be (the absolute value of) the correlation between x i and y , as measured on the training data. This would result in our choosing the features that are the moststrongly correlated with the class labels. In practice, it is more common (particularly for discrete-valued features x i ) to choose S ( i ) to be the mutual information MI ( x i , y ) between x i and y :

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask