Introduction to feature selection (part 2)

This post continues the previous one about feature selection. I now give some examples of wrapper-based techniques for feature selection.

In wrapper techniques, using the classification algorithm as a black box, any search strategy can be used in combination. This makes wrapper approaches universal. The accuracy of the classification algorithm may be used as the objective function of the search strategy. As with any classification algorithm, wrapper feature selection techniques face the overfitting problem that may happen while training. One way to reduce the overfitting problem is to use a k-fold cross-validation strategy. Correlation coefficient and mutual information are cheaper to evaluate than cross-validation. However, when the number of features is high, they become difficult to evaluate.

Several search algorithms that select a subset of m features out of d exist in the literature. Although exhaustive search is guaranteed to find the optimal feature subset, it is not feasible when d is not small. The branch and bound procedure explores only a part of all possible feature combinations. It is guaranteed to find the optimal feature subset in less time than needed for exhaustive search. Its main drawback concerns the monotony assumption of the feature selection objective function. That is, if a variable is added to the feature set, it should never decrease the value of the objective function, which is not always the case.

Individual ranking procedures can be named naive methods. The idea is to individually rank each feature at a time, according to its prediction power. This technique is valid only if every feature is independent, which is usually not the case in practice. Sequential forward selection (SFS) starts with a single feature and iteratively add a feature to increase the classification criteria. Sequential Backward Selection (SBS) starts with all features and iteratively remove a single feature to increase the classification accuracy. Although combination of features are taken into account with this technique, a high number of computations are necessary since it starts with the set of all features. This may not be feasible for very high dimensional data set. The plus l take away r method combines the former two techniques by first applying SFS for l features and then SBS for r features. Although this method seems promising, a value for l and r have to be given by the user. Sequential Forward Floating Search (SFFS) and Sequential Backward Floating Search (SBFS) are generalization of the two former methods. In this case, l and r are determined automatically.

I hope these two posts have introduced you to feature selection and more specifically to wrapper-based techniques. Feel free to comment.

(1) François, D., High-dimensional data analysis: optimal metrics and feature selection, Université catholique de Louvain, Louvain-la-Neuve, Belgium, 2007.
(2) Reunanen, J., Overfitting in Making Comparisons Between Variable Selection Methods, Journal of Machine Learning Research, 2003, 3, 1371-1382.
(3) Jain, A.K., Duin, R.P.W. and Mao, J., Statistical Pattern Recognition: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22, 1, 4-37.

Share

Comments

6 Comments on Introduction to feature selection (part 2)

  1. Jeff Zanooda on Fri, 12th Oct 2007 5:06 am
  2. For sake of completeness it might be worth mentioning known problems associated with popular stepwise feature selection methods, and alternatives such as PLS and LARS/Lasso.

  3. Sandro Saitta on Fri, 12th Oct 2007 4:51 pm
  4. Thanks for the link Jeff! The discussion seems really interesting and relevant. I will read it soon.

  5. damien francois (Intelligent machines) on Tue, 19th Feb 2008 9:02 am
  6. Hey Sandro, thanks for citing my thesis as a reference :) !

  7. Sandro Saitta on Wed, 20th Feb 2008 8:59 am
  8. Damien,

    I also refer your work in my thesis since I found it relevant (in particular the discussion of the Euclidean distance in high-dimensional spaces).

  9. John Aitchison on Thu, 21st Feb 2008 8:14 am
  10. Hi Sandro

    You might also want to alert your readers to the possibility of using Bayesian Model Averaging .. including procedure bicreg.

    A good example is here
    http://www.r-project.org/doc/Rnews/Rnews_2005-2.pdf

    regards

    John Aitchison

  11. Sandro Saitta on Thu, 21st Feb 2008 5:37 pm
  12. Hi John,

    Thanks for the link! I will have a look at this Bayesian Model Averaging. And I just saw that you’re back in the blogosphere! Cool!

Tell me what you're thinking...





  • Swiss Association for Analytics

  • Most Popular Posts

  • T-shirts, Mugs & Mousepads


    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations