Kickstarter Campaign Analysis


Consider the following: You have an idea and want to start a Kickstarter campaign to raise money in order to make your idea a reality. You’re a bit hesitant though because you don’t want it to end in failure.

So I did a project where I aim to see if we can determine factors that play an important role in determining the success of a Kickstarter campaign to get an idea before someone launches the campaign.

  • What kind of trends are there with regards to success of campaigns?
  • What are the most important factors in the success of campaign?
  • What might these factors imply?

A key factor would be the length/duration of the campaign because you want to give yourself enough time to raise the funds, but not so long that there is a lack of urgency when potential backers look at it.

How much money you’re trying to raise and what type of item/service it is will also play a part.

We’ll see later if my thinking was correct or not.


For this project I got the Kickstarter campaign data from a site that scrapes and collects data from Kickstarter every month. This data had been collected going back as far as 2015. I combined all the data from all the months and removed duplicates, keeping only the most recent data for a campaign. There were four different states for a campaign: live, cancelled, success and failed. I limited my dataset to only the campaigns that were successful or failed as that is the focus of my project and at the end I was left with ~170K results.

I had a few categorical variables such as categories and subcategories the campaigns fell under which I created dummy columns for. I also made columns for the year and month the campaigns were launched, how many words were in the blurb, and the duration of the campaign in days.

There was a 57:43 split between classes

There was a slight class imbalance, but it was not very significant so I did not feel the need to up or down sample.

We can see there were certain years that had a higher success rate than others, such as the years 2010–2013 doing significantly better than other years.

Not too surprising, but some categories of campaigns had a higher success rate than others so this may give some insight into determining if a campaign will be successful or not.

I used Sci-Kit learn to split my data into a training set and a test set before running any models. I ran some Random Forest classification models on the data with and without the subcategory dummy columns, doing some tuning as well, seeing how varied the results would be as the models would be much simpler and have lower dimensionality without them.

I then ran some XGBoost models as well for Boosting algorithms are more advanced and generally perform better.

While there was an increase in both accuracy and F1 score when the subcategory columns were included, it was not significant enough in my opinion considering that subcategories made up 159 of my features, more than double the remaining ones. I decided that the reduction of complexity and dimensionality was a beneficial trade off in this case. That is why I chose to stick with the XGBoost model without subcategories as my final model.

I took a look at the feature importance from the model to see what features were most relevant in determining the success or failure of the campaigns.

When looking at the feature importance by weight, we are seeing how many times or how often a certain feature was used in the trees when trying to predict success

Some features that stand out here:

  • When campaign started and ended
  • How much they were trying to raise(goal)
  • Length(duration of campaign)
  • The length of the blurb is surprisingly an important feature as well

When looking at the importance by gain, we are seeing how much information is gained when splitting on a certain feature.

Some features that stand out here:

  • Splitting on Year launched gave a lot of gain
  • Mostly composed of Categories
  • Goal(how much they were trying to raise) also in top 10

Conclusion/Real World Application

  • Goal definitely plays a big role, it was an important feature in both how many times it was used in the trees as well as the amount of information gained from it
  • Category, or the type of the campaign, also has an impact on giving information gain, which makes sense as we saw in the earlier bar graph, certain categories did much better than others
  • Starting/end date and the year the campaign was launched were important features, but not good for future predictions due to variability and the fact that you would not know how a certain year will be at the start of a year

Next Steps

  • Remove certain features that don’t help with future predictions, such as start/end date, launch/end year
  • Staff pick may not be a useful feature to use for prediction, but may be something to aim for seeing as it had a lot of correlation with successful campaigns
  • Make predictions on campaigns that are currently live and see how accurate they are once completed
  • Create a pipeline to be able to feed in newer data and update the model/results

Data Science student at Flatiron School