The data mining effect

Yes, data mining can have an effect. And I’m not thinking about the effect you feel when you do the most exciting job in the world. In this post, I want to discuss the effect of applying data mining in an iterative way (call it the data mining bias if you prefer). Let me explain this topic with a concrete example.

Think about market basket analysis (association rules mining). Imagine a case study that can happen at Amazon (to cite the most well known). We collect transactions made by customers. We build a model to suggest other books you may purchase. One month later, we run the same process again. However, the data collected was biased by the previous model. After several iterations, we may miss important associations if customers mainly buy what they are recommended.

In another perspective, one may think that data mining (particularly in this case) limits the choices of the user. I have already heard this argument from detractors of data mining. On the other hand, recommendations also increase your chances of finding the right book for you.

What do you think of this issue? First, is it an issue? Is the data mining effect real and problematic? Can we add some randomness in the process to avoid it? This post is aimed at opening the discussion about this issue (or pointing to relevant literature). So, feel free to comment and give your opinion!


Recommended Reading

Comments Icon13 comments found on “The data mining effect

  1. This is a pretty interesting topic that I have never really thought about before. It seems like a paradox at first glance.

    I am interested in hearing what others have to say about this topic.

  2. 1) Is it a problem? I think yes. πŸ˜‰

    2) Why? Because the use of transaction data and market basket analysis, more or less generates “stupid” recommendations.

    3) Why stupid? Because a customer represents more than the sum of its transactions.

    4) The solution? Combination of different analytical methods and the consideration of more aspects than just the transaction data.

  3. Unless recommendation / next best product models built by others are a lot better than mine :-), it is unlikely that enough customers will convert to actually alter the population distribution. A control group is always recommended to gauge a models effectiveness. If this is a concern, one could make the control group large enough and then rebuild/fit new models on that partition to make sure there is not a self fulfilling bias at play. Part of me wants to say so what? If we can cross sell that effectively, do we care?

  4. A big part of the “effectiveness” of the tools resides in the marketing realm, where the question is whether the process of performing the data mining and making recommendations drives the customers to make an additional purchase in the future. I agree with matze’s comment that incorporation of additional analytical methods could yield richer prediction models.

    I find it humerous when Amazon throws in book recommendations based on my wife having used my account once to pull a book for her studies, but otherwise serves as a pretty wild outlier. On the other hand, if she were to pull up my account again, she would probably find those recommendations of use.

    End result in that scenario is that when I look at the recommendations, it would not drive me to select a book related to her previous purchase. But it might guide her to a useful new book to consider, and as such would have been a successful process.

  5. @Rick – Or it could be use in choosing a book as a gift for your wife. I certainly have picked mine a couple from the Amazon recommendations based on what she’s looked at.

    @Jeff – Yes. A control group is needed to gauge the model’s effectiveness, but also to stop future models being skewed by the % who convert. (Even a small % could throw numbers & in Amazon’s case a small % is probably a large number).
    How much should a control group be and is having one and losing potential sales worth the gain by having an unbiased recommendation model?

  6. That some customers convert and actually buy products based on recommendations can be thought of as a positive feedback. It means that the recommendation was of use and that it is likely to be of use to others as well. This bias contributes then to improving the recommendations. Market Basket Analysis was given as an example, but Amazon (and others) use more advanced algorithms that include such implicit and/or explicit feedback mechanisms.
    So, for me this is not an issue.
    An interesting trend though as you said, is to include randomness, or “diversity” in recommenders to avoid monotony.

  7. First, thanks to all of you for your interesting comments!

    @matze: I particularly agree with point 3) although I would not use the word “stupid” πŸ™‚

    @Jeff: Good point about the fact that the few conversion should not affect the data and thus the next model. However, customer transaction is just an example. We could apply association rules to a case where the actions will be taken 80% of the time.

    @Rick: It seems that sharing an Amazon account is a good way to get ideas for presents πŸ™‚

    @JEdward: In Behavioral Targeting (BT), I was using control groups of 10%. If you use more, advertisers will not be happy (since you show your ad to random visitors). However, 10% is very low in click counts and thus the numbers are rarely statistically significant.

    @Omar: I guess Amazon is already using this randomness to avoid “local optimum” and generate diversity.

  8. In terms of randomness, how about doing like wikipedia and having a ‘random” link to a product, to encourage curiosity – ‘hmm, what will I get if I click on this’? Having that as well as recommendations.

  9. It can be problem if you apply it carelessly to recommender systems. There are solutions though. Holland proved that genetic algorithms optimally allocates computing resources between search space exploration and exploitation if you can’t make any assumptions about the search space. In essence this adds a random element to the recommendations made.

    Collaborative filtering (as used by Amazon) also allows exploration of a larger spaces since their search space is very big and people usually have multiple interests.

  10. @hakan: Thanks for this information. I think that one of the future direction of data mining is to use a combination of existing techniques to solve a given problem.

Comments are closed.