Readability of Decision Trees

November 25, 2008 by
Filed under: decision tree, Readability, understandability 

One of the most often cited advantage of decision trees is their readability. Several data miners (to whom I belong) justify the use of this technique since it is quite easy to understand the obtained model (no black box). However, there are certain issues that make decision trees unreadable.

First, there is normalization (or standardization). In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless. Of course, you can map the data back in the original format, but it has to be done.

Second is the number of trees. In the project I carry on at my job, I can have 100 or more decision trees by month (see this post for more details). It is clearly impossible to read all these trees even if they are independently understandable. The same happens with random forests. When there are 1000 trees voting for a given class, how can one understand the process (or rules) that produce the class output?

Decision trees still have a lot of advantages. However, the “readability” advantage must be taken with care. It may be valid in some applications, but can often be a mirage.

5 ways to trick your body into being more fake oakleys awesomeThe resistance provided by the water will help you build strength and explosive jumping power in your legs. Kicking the ball cheap football jerseys high and far is not just about explosive strength in your legs and hips. Much of a kicker power comes from your abs and core muscles. Nutritional needs are so individualized that rigid, one size fits all diets don work long term. Intuitive eating is well researched. One of the NBA Jerseys Cheap reasons that it successful is because you stop allowing your inner critic to judge, shame wholesale Jerseys and sabotage you. (a) Typical gait patterns of 8 month old transgenic mice monitored with the semiautomated Catwalk system. Colored bars cheap oakleys represent the time a paw makes contact with the floor plate. LH, left hindpaw; LF, left forepaw; RH, right hindpaw; RF, right forepaw. Here are a couple of these types of plays that make sense at these levels. They are real estate investment trusts (REITs). This sector had underperformed the market significantly in 2013. Monday was an interesting day for Yasiel Puig. He loses his right field job ray ban sunglasses when the Dodgers acquire Reddick. Then, Ken Rosenthal reported that Puig stormed from the stadium and missed the flight to Colorado after he was informed he was going to be traded or sent to the minors. Now, they must apply and prove they’ve repaid all court fees and restitution to victims. Florida Gov. Rick Scott (R) instituted a five year waiting period before people who’ve served their sentences can apply to get the vote.. No, that’s a real photo. I felt like we needed a disclaimer. Your investment story you’ve been waiting for. Good patriots head coach Bill Belichick he’s notoriously. Tightlipped often giving short even just one word answers to reporters so no trick is also known for his sense of style he really is a fashion these dead at tree answered are often. Referring caught off sweat shirts even in the middle. Michael Irvin knows a cheap china jerseys thing or two about about fierce receiver corner battles. So when the hall of fame receiver weighs in on the Dez Bryant Josh Norman feud, it probably a good idea to listen up. But I enjoy within the game. Africa is a unique land of varied traditions, culture, people, languages and geography. It is the second largest continent of the world. It is also the second most populous continent after Asia. To make matters worse, the play was placed under review by league instant replay rules and the off field officials were put in position to reverse the inaccurate decision made by their colleagues. With only the ‘eye in the sky’ as a means of over turning and correcting the bad call, the decision rendered by the instant replay officials inexplicably agreed with the on field call. Total mayhem broke loose as media commentary ripped the referees to shreds (Note: former NFL referee Jerry Austin, an on air consultant, later explained that replay officials in the booth cannot rule on possession)..


11 Comments on Readability of Decision Trees

  1. Tim Manns on Fri, 28th Nov 2008 6:10 am
  2. Hi Sandro,

    I have a few thoughts.

    a) Hey, this bit;
    “In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless.”
    – I reckon not necessarily true!

    Depends on your normalisation. You can normalise your data with meaning!

    I like binning into 100 buckets, each with the same numbers of occurrences (say, customer rows). I do this for a few reasons, one being that I can then report customers as being in the top 5% buckets etc. It is also an easy and fast way to rescale lots of data in SQL. CART or C5.0 models using this type of normalised data is actually quite easy to make sense of (eg, “if stock price is above 70% bucket” etc etc).

    b) Random forest doesn’t work well with big datasets (millions rows). I use fairly easy CART or C5.0. Sometimes I build a handful of models on subsets samples and average the models, but I’m not convinced hundred of models is the best way to go. I always take time creating new derived ‘information rich columns’ and using these as additional inputs to a decision tree or neural net.

    I agree with the problems you describe and, for those reasons you mention, I don’t follow the steps you describe. Maybe I’m jaded, but I believe Random Forests is a classic example of mad academia over practicality (and yes, I know that’s controversial considering the brilliant guy who created random forests…).

    – Tim

  3. Sandro Saitta on Fri, 28th Nov 2008 2:31 pm
  4. Hi Tim,

    Thanks for your comment!

    1) Binning the data into buckets is a nice way to avoid this “unreadability” problem. I have always used normalization or standardization, but never used binning. Also the fact that you have the same number of occurrences in each bin avoid the issue of outliers.

    2) Regarding random forests, I definitely agree on the issues when using this technique (that’s why I don’t use random forests). However, I really like the concept of several models voting for the output class.

  5. Shane Butler on Sun, 30th Nov 2008 11:30 pm
  6. I think random forest are both useful and powerful… with some caveats… you need lots of memory for big data (so not practical for all tasks) and readability is also a problem. Rattle solves the random forest readability issues by producing an importance chart.

  7. Sandro Saitta on Mon, 1st Dec 2008 5:12 pm
  8. Thanks for your contribution Shane!

  9. James Pearce on Thu, 8th Jan 2009 1:07 am
  10. Random forests have the advantage of opening up problems that a single decision tree might not deal with well, such as when a class of interest is relatively rare. I prefer boosted trees for this situation, though.

  11. Sandro Saitta on Thu, 8th Jan 2009 5:05 pm
  12. Thanks for your comment James!

  13. Lucian on Tue, 24th Nov 2009 11:17 am
  14. Hello Sandro,

    could you point me to some basis regarding the necessity of normalizing for decision trees? or how could it be accomplished? I know about the need to normalize data for some neural networks, but I thought this step it is not required for decision trees…

  15. Sandro Saitta on Sat, 28th Nov 2009 11:44 am
  16. @Lucian: Thanks for your comment. As written by Tim, decision tree don’t really need normalization (I think due to the fact that with entropy you work on probabilities). However, I usually prefer to work with normalized data. It is also safer if you later decide to use neural networks instead of decision tree.

    If, like me, you prefer to work with normalized data, then you can simple do a normalization or standardization process before using any data mining algorithm. You can see this post for more details.

  17. Daniel on Sun, 25th Sep 2011 4:37 am
  18. @santro sir i have a doudt …….why should we prefer decision trees still , though many advanced techniques ve been invented……….. wats the advantage of using decision trees for uncertain data……..

  19. Sandro Saitta on Mon, 3rd Oct 2011 8:18 am
  20. @daniel: There are several advantages. Among them I see readability, easy interpretation of results, implicit feature selection and very few data preprocessing needed. Of course it depends on the application needs.

  21. vey on Fri, 27th Jul 2012 6:21 am
  22. Hi all, I have plan to implement random forest in imbalanced data for my final project in college. From several paper I’ve read, weight random forest is good enough for handling imbalanced data.
    I read here, that random forest is not good enough….
    Can i have reasons of this? And what about Weight Random forest that implement cost-sensitivity learning for handling imbalanced class.
    I’ll appreciate any ideas 🙂

Tell me what you're thinking...

  • Swiss Association for Analytics

  • Most Popular Posts

  • T-shirts, Mugs & Mousepads

    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations