Predictive Analytics Toolbox

toolbox1Among the years, hundreds of predictive techniques have been developed. They all have their advantages and drawbacks. There is no best choice that covers all data mining applications.

Usually, the choice of the predictive algorithm depends on one (or more) of the following factors:

  • Accuracy in the current context (after trial and errors compared to other algorithms)
  • Best practices from the literature (for example, Support Vector Machines are known to be efficient for image recognition)
  • Data miner preferences and knowledge (a data miner may prefer neural networks because he used them more often or better understand them)

As it is the case for many data miners, I have a predictive algorithm toolbox that allows me to solve 80% of the problems. It is composed of:

  • Pearson’s correlation, to better understand linear relationships among data
  • Decision tree, a powerful method providing readable models
  • Support Vector Machine (SVM), a kernel method quite robust to overfitting

What are your preferred techniques for predictive analytics? What is your personal toolbox?


Recommended Reading

Comments Icon3 comments found on “Predictive Analytics Toolbox

  1. I’m a big fan of decision tree and linear regression. If that’s not enough, I use a hand made mix of genetic algorithms and ensemble classifiers.

    I still don’t get SVM, especially how you find the kernel to use. I love the idea of avoiding overfitting, but what the point if it’s to introduce a soft marging and space transformation that destroy this nice feature. When I see the number of illustrations that use 4 points for supporting vectors I think few people understand the basics of SVM (for wikipedia, the french version use 4 point, the german probably 4, only the english one is correct with 3 points).

  2. Sandro,

    My favorite “hammer” for classification and regression problems is gradient boosted regression trees. Seamless handling of missing values, mixed type potentially correlated predictors, high accuracy, variable importance measures, partial dependency plots to understand average marginal effects of the inputs etc. make this my first go-to algorithm in the toolbox.

    Generally, I also like ensembles of different classier types (hybrid ensembles).

  3. Another vote for gradient boosted tree ensembles as the first call for many problems, for the reasons @Jeff mentioned. (I most often use the gbm package in R.)

    After that mixed ensembles. Often using the caret package which provides a reasonable consistent interface to >140 different models with all the bagging, cross-validation/bootstrapping, and parallel computing already taken care of.

    Sounds like Jeff and I should set up an echo chamber somewhere.

    Great post, Sandro

Comments are closed.