Finding Interests of Visitors through Data Mining

In a previous post, I mentioned one of my current projects at FinScore. In this post, I will discuss another possibility of online targeting when customer data and web profiles are merged together (for customers of a given company).

First, I briefly define an interest group (IG) as an ensemble of visitors with the same interests. Each URL of a given website can be mapped into an interest. Examples of interests are Auto, Lifestyle, Entertainment, Sport, etc. A visitor can belong to zero, one or several interest groups at the same time.

Here is how the process works. First, a set of identified visitors (1) are tracked on the website. Since the pages they visit are recorded, one can deduce their IG. We build a model using as input the CRM data of these visitors and as output their IG (binary variables for each IG). In our case, this model is trained using a decision tree algorithm. All clients of the company, in fact the ones that have never been on the website, are scored using the obtained model. It is thus possible to infer their IG before they visit the website.

One possible use of such score is to decide which ad, content or product to show to a new visitor (that is already a client of the company). It is also possible to use this IG information with another channel such as mail, e-mail or phone in a 1-to-1 marketing context. For example, clients having a high score for the sport IG may be contacted for a sport product, etc.

So, what do you think of this approach? Have you tried similar approaches or completely different ones? I’m looking forward to reading your comments.

(1) “Identified” means that these visitors are also client of the company. We can thus merge CRM data with web behaviors, for example in a “log in” area.


Recommended Reading

Comments Icon2 comments found on “Finding Interests of Visitors through Data Mining

  1. Hello Sandro

    Interesting project. If I understand you correct, then you build a flat static model based on features independent of the behavior on the website . The target variable then is deduced dependent on the sites visited, so that you can predict the interest of other visitors as long as they a) are logged in and b) are already clients. As you said, they do not even have to visit the site to get scored.

    So far, this strategy sounds good to me 🙂

    I guess the hardest part is to calculate the label. Merging visited sites, time spent on page and maybe the multiple categories of a site into one crisp value is, well, hard.

    Even harder is to identify how interested a visitor, who is not logged in, for certain categories. Now we are leaving the area of flat models, now it is getting interesting …Any plans in this direction ?

    Another point is, that the categories may have a hierarchy (do they ?). I recently experienced that building models for hierarchical models is really non trivial. Any thoughts on this issue ?

    kind regards,


    PS: More posts like that. I like them technical 😀

  2. @Steffen: Thanks for your comment. Your explanations really correspond to what we are doing (I’m happy that you got it).

    Just a few points. First, there is only one website where we gather the user behavior. And yes, we predict interests for people that are clients AND that we can identify (either by logging or other means). This identification allows us to make the link between online data (web profile) and offline data (CRM).

    For us, the hardest part is to aggregate the web raw data from the web logs (several Gb per day) in a web profile for each user.

    When confronted to unidentified visitors, we find interest groups after they have visited 1 or 2 pages on the websites using the human-defined rules. But of course, there may be other means of doing.

    We have the following levels: URL -> Categories -> Interest Groups. Even if categories may have hierarchies, we don’t explicitely use this concept in our solution for the moment.

    I’m planning to write more about these topics in the following weeks 🙂

Comments are closed.