Predictive Analytics Toolbox

November 29, 2012 by Sandro Saitta
Filed under: Uncategorized 

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

toolbox1Among the years, hundreds of predictive techniques have been developed. They all have their advantages and drawbacks. There is no best choice that covers all data mining applications.

Usually, the choice of the predictive algorithm depends on one (or more) of the following factors:

  • Accuracy in the current context (after trial and errors compared to other algorithms)
  • Best practices from the literature (for example, Support Vector Machines are known to be efficient for image recognition)
  • Data miner preferences and knowledge (a data miner may prefer neural networks because he used them more often or better understand them)

As it is the case for many data miners, I have a predictive algorithm toolbox that allows me to solve 80% of the problems. It is composed of:

  • Pearson’s correlation, to better understand linear relationships among data
  • Decision tree, a powerful method providing readable models
  • Support Vector Machine (SVM), a kernel method quite robust to overfitting

What are your preferred techniques for predictive analytics? What is your personal toolbox?

No TweetBacks yet. (Be the first to Tweet this post)
  • Share/Bookmark

Comments

3 Comments on Predictive Analytics Toolbox

  1. S├ębastien Derivaux on Thu, 29th Nov 2012 7:45 pm
  2. I’m a big fan of decision tree and linear regression. If that’s not enough, I use a hand made mix of genetic algorithms and ensemble classifiers.

    I still don’t get SVM, especially how you find the kernel to use. I love the idea of avoiding overfitting, but what the point if it’s to introduce a soft marging and space transformation that destroy this nice feature. When I see the number of illustrations that use 4 points for supporting vectors I think few people understand the basics of SVM (for wikipedia, the french version use 4 point, the german probably 4, only the english one is correct with 3 points).

  3. Jeff on Fri, 30th Nov 2012 8:04 pm
  4. Sandro,

    My favorite “hammer” for classification and regression problems is gradient boosted regression trees. Seamless handling of missing values, mixed type potentially correlated predictors, high accuracy, variable importance measures, partial dependency plots to understand average marginal effects of the inputs etc. make this my first go-to algorithm in the toolbox.

    Generally, I also like ensembles of different classier types (hybrid ensembles).

  5. Allan Engelhardt on Fri, 30th Nov 2012 10:26 pm
  6. Another vote for gradient boosted tree ensembles as the first call for many problems, for the reasons @Jeff mentioned. (I most often use the gbm package in R.)

    After that mixed ensembles. Often using the caret package which provides a reasonable consistent interface to >140 different models with all the bagging, cross-validation/bootstrapping, and parallel computing already taken care of.

    Sounds like Jeff and I should set up an echo chamber somewhere.

    Great post, Sandro

Tell me what you're thinking...





  • Swiss Association for Analytics

  • T-shirts, Mugs & Mousepads


    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations