Data Mining Research: Could you introduce yourself and explain how you entered the field of data mining?
Vincent Etter: I’m a PhD student in Computer and Communications Sciences at EPFL. I did both my Bachelor and Master in Communication Systems at EPFL, and, after a couple of internships in the US, I started my PhD in September 2010. I did my Master thesis in the Machine Learning department of NEC Labs, in Princeton, and have been fascinated with machine learning and data mining since then. I’ve always loved mathematics and programming, and I found that combining these two sets of skills, in order to process and understand huge amounts of data, is highly challenging and very interesting.
DMR: What is Kickstarter? How did you come with your idea to predict Kickstarter results?
VE: Kickstarter is a crowd-funding website: people with an idea for a project can create a campaign on the website to raise money realize it. The campaign has a goal (how much money the creator needs to realize the project), and a duration (for how long the campaign will run). Once the campaign has launched, people can pledge money to the project, and get various rewards in return.
A specificity about Kickstarter is that the funding model is “all or nothing”: if, once a campaign is finished, the goal is not reached, the creator does not get anything. Thus, it is of high interest to both the creator and the backers to know whether a campaign will reach its goal and succeed, or not.
I have been using Kickstarter for a couple of years now, and I noticed that projects have very different behaviors: some take off really fast and become viral, while others never get any attention. I wondered if it would be possible to predict which projects succeed before the end of their campaign, and thus started gathering data and building models to study this problem.
DMR: Which data and techniques do you use in your predictions?
VE: I monitor the “Recently Launched” page of Kickstarter to discover new projects. Once I have a project in my database, I regularly check its page to see how much money is pledged, and how many people pledged money. In parallel, I monitor Twitter and record all tweets containing a link to a project.
This means that for each project, I have the time series of the number of backers and the amounts of pledged money, from the beginning of the campaign to its end, as well as all tweets mentioning the project. I then combine a mathematical model (Markov chains) with standard Machine Learning algorithms (k-nearest neighbors and support vector machines) to predict if a project will succeed or not, based on this information.
In particular, I have two sets of models: the first uses only the time series of pledges, while the second uses social information, such as the tweets and the list of backers. I found that money-based predictors perform better than the ones using social information, but combining them together gives the best prediction accuracy, with 76% of correct predictions only 4 hours after the beginning of a campaign.
DMR: What is the biggest challenge you faced?
VE: Gathering the data was not trivial, as it required to write custom crawlers to parse Kickstarter, and constantly monitor it. Furthermore, finding the best combination of models, and tuning their different parameters, required to run hundreds of thousands of experiments on our cluster. However, the models I used are pretty simple, and I am currently working on improving the performances by using more elaborated algorithms.