Learning about Video Games Through Regression


Video games are being developed and released all the time. Whether they are made by big or small companies, there are video games out there for all kinds of people.

However, when developing a video game or when it is about to be released, it would be helpful to know what will affect the sales of the game so as to try to maximize sales as well.

Big Questions

Some of the big questions I’ll be trying to answer through this regression analysis are:

Are there noticeable trends/patterns with regards to video game sales?

What factors have the biggest impact on video game sales?

What factors have the biggest impact on video game ratings?


A lot of times, if a person is unfamiliar with a game, they will look at its rating to get an idea of how good the game is before buying it. For this reason I think that the ratings(metascores) will be an influence on video game sales.

Video games released right before holiday season might produce more sales due to people looking to buy gifts for friends and family.

The Process

The initial datasets were retrieved from Kaggle and I used two datasets:

I got one dataset to get information on Video Game Sales. This dataset had sales information on over 55 thousand video games.

The second dataset I used was about Metacritic Scores and had over 15 thousand video games listed. I used the metacritic scores as the ratings for the video games.

The next step was to clean up and format the video game names and consoles. I did this in order to have an easier time merging the two datasets since the names of the games and consoles were not always formatted the same and I would not get accurate data if they did not match up correctly. I never before realized how many different ways you could write the name of a video game.

After I cleaned and formatted the data as well as I could, I merged both datasets and was left with a little over 8 thousand video games. This was a definite drop from the initial datasets, but it was better than having a larger amount of data that was not accurate.

After that came the meat of the project that would give insights into the data. I did some EDA and looked at overall trends and relationships between the features.

Then I setup and tested linear regression models with and without interactions between the features.

Initial EDA

First, taking a look at how the average sales of video games relates to the month the video game was released. There is a definite trend between Sales and certain months, particularly the fall months with November games having the highest average sales. This supports my hypothesis from earlier because we have Black Friday and Cyber Monday in November and it is also right before Christmas and New Years.

Next, if we take a look at the average sales of video games by console, we see that certain consoles have more sales on average than others, dominated by Playstation home consoles. One reason for this could be that there is an imbalance in the data with more better selling games from certain consoles and less from others.

From the graph above, we can clearly see that Mature rated games sell more on average, which could be because they are more geared toward an older audience who have more money to spend.


I ran regression models trying to determine important factors for Sales as well as for Metascore ratings. I ran them both with the initial features and also after adding polynomial interactions, but in both cases the models did better with the interactions.

Sales RMSE scores for regression models with polynomial interactions

So above is a chart showing the RMSE results of the regression models for predicting Sales of video games. Filtering solely by high correlation did not remove enough features and resulted in an overfit model.

The K-best features model did have the lowest RMSE, but the Lasso model used half the number of features with only slightly worse results. For that reason, I chose to examine the features in the Lasso model to try and understand what features are important in determining sales of a video game.

Metascore RMSE scores for regression models with polynomial interactions

Next, if we look at the RMSE scores from the regression models for determining the Metascore(critic rating), we see that once again just filtering out highly correlated features gives an overfit model and the Lasso model performs the best with the lowest RMSE scores. In this case, the K-best model does use less features, but it’s only by a small amount, so I decided to select the Lasso results to look into in the case of ratings as well.

Regression Analysis Global Sales

Top 10 features by coefficient size for Sales Lasso Model

What we see when we look at the top 10 features by coefficient size is that almost all of them are interaction features. Most are some kind of interaction with metascore.

Metascore has the largest and slightly negative effect on sales by itself. The second largest effect was also negative and was age with Nov release date, describing older games released in November. This may mean that Sales of games released in Nov decrease slightly with age.

Regression Analysis Rating(Metascore)

Top 10 features by coefficient size for Ratings Lasso Model

When we look at the top 10 features by coefficient size is once again that almost all of them are interaction features.

The highest impact came from base sales and age. Sales had the largest, and positive, impact, while age had the second highest, but negative, impact on rating.


The first thing we see is that both Sales and Ratings seem to be dependent on interactions of features, how certain features in work in conjunction to either push the results up or down.

The next thing we see is that Sales has the largest effect on Ratings and vice versa, both individually and interactively. It makes sense that a large number of sales would mean a game is likely highly rated, but it was somewhat strange that as metascore increased sales seemed to decrease.

Lastly, we saw that games that generally have high sales, have lower sales over time. I inferred this from the interaction between Age and games released in November. We saw earlier in our EDA that November games had the highest average sales and the interaction term had a slight negative coefficient. So it might be that they make less over time, or it might mean that more recent games that were released in November had slightly higher sales than older November games.


So, how do these results apply to people making and buying games? Let’s take a look:

  • If you are looking to get a game, high sales are a good indicator that it has high ratings.
  • If you are developing a game, a mature shooter released in Nov will likely see high sales starting out and will decrease over time.
  • There were a large number of publisher/developers, but that may be worth looking into to potentially see their effects(or lack thereof) on Sales and Ratings

Thank you for reading, this was one of my earlier projects and there were definitely places where I would do things a bit differently which to me shows that I have learned a lot since then, and I might go back and make some of those changes if I have the time.

Data Science student at Flatiron School