Kickstarter Campaign Analysis

5 min readAug 5, 2020

Overview

Consider the following: You have an idea and want to start a Kickstarter campaign to raise money in order to make your idea a reality. You’re a bit hesitant though because you don’t want it to end in failure.

So I did a project where I aim to see if we can determine factors that play an important role in determining the success of a Kickstarter campaign to get an idea before someone launches the campaign.

Big Questions

What kind of trends are there with regards to success of campaigns?
What are the most important factors in the success of campaign?
What might these factors imply?

Thoughts going in

A key factor would be the length/duration of the campaign because you want to give yourself enough time to raise the funds, but not so long that there is a lack of urgency when potential backers look at it.

How much money you’re trying to raise and what type of item/service it is will also play a part.

We’ll see later if my thinking was correct or not.

Process

Gathering Data:

For this project I got the Kickstarter campaign data from a site that scrapes and collects data from Kickstarter every month. This data had been collected going back as far as 2015. I combined all the data from all the months and removed duplicates, keeping only the most recent data for a campaign. There were four different states for a campaign: live, cancelled, success and failed. I limited my dataset to only the campaigns that were successful or failed as that is the focus of my project and at the end I was left with ~170K results.

Feature Engineering:

I had a few categorical variables such as categories and subcategories the campaigns fell under which I created dummy columns for. I also made columns for the year and month the campaigns were launched, how many words were in the blurb, and the duration of the campaign in days.

Preliminary EDA:

There was a slight class imbalance, but it was not very significant so I did not feel the need to up or down sample.

We can see there were certain years that had a higher success rate than others, such as the years 2010–2013 doing significantly better than other years.

Not too surprising, but some categories of campaigns had a higher success rate than others so this may give some insight into determining if a campaign will be successful or not.

Modeling:

I used Sci-Kit learn to split my data into a training set and a test set before running any models. I ran some Random Forest classification models on the data with and without the subcategory dummy columns, doing some tuning as well, seeing how varied the results would be as the models would be much simpler and have lower dimensionality without them.

I then ran some XGBoost models as well for Boosting algorithms are more advanced and generally perform better.

Initial Findings/Models

While there was an increase in both accuracy and F1 score when the subcategory columns were included, it was not significant enough in my opinion considering that subcategories made up 159 of my features, more than double the remaining ones. I decided that the reduction of complexity and dimensionality was a beneficial trade off in this case. That is why I chose to stick with the XGBoost model without subcategories as my final model.

Looking a little deeper

I took a look at the feature importance from the model to see what features were most relevant in determining the success or failure of the campaigns.

By weight:

When looking at the feature importance by weight, we are seeing how many times or how often a certain feature was used in the trees when trying to predict success

Some features that stand out here:

When campaign started and ended
How much they were trying to raise(goal)
Length(duration of campaign)
The length of the blurb is surprisingly an important feature as well

By Gain:

When looking at the importance by gain, we are seeing how much information is gained when splitting on a certain feature.