Our project is an anime recommendation system. Recently everyone has been at home due to the corona virus but if you’re like me and really enjoy anime, finding a new one to watch is a job on its own. Finding a new anime in and of itself is not hard, but just randomly picking and watching an anime which can range from 7–1000 episodes can lead to some bad results. So we built a recommendation system using the surprise library to try and find anime that will be enjoyable based on the user ratings of different anime as well as the sentiment and vector scores of their reviews.
When building our recommendation system, we used 1032 anime with 72K user reviews and ratings. The ratings ranged between 1 and 10, and not all anime and users had the same number of reviews. The scoring metric we used to evaluate our models was Root mean squared error(RMSE).
There were 4 steps to our process.
First, we webscraped user ratings and reviews from myanimelist.net.
Second, we cleaned and tokenized the reviews.
Third, we converted the cleaned reviews into a sentiment and vector score.
And finally, we used all the different ratings in isolated recommendation models.
When processing our data before running it through our recommendation models, we separated the reviews and ratings into separate dataframes. We cleaned up the reviews and tokenized them, using Textblob to get sentiment scores and Spacy to get the normalized vector score for each review. We then concatenated the ratings,sentiment scores and vector scores into one final dataframe to test our recommendation models on.
First, we wanted to get an idea of how our ratings were spread out. We can see here that most animes were rated a 7 or above. This is because our list contains the top 1000 anime from the website, which we chose as they were the most likely to have multiple reviews from different users
When looking at the average ratings for the anime that had the most reviews, we again see that ratings are 7 or above. As Antonio mentioned this is because we were looking at the top rated anime.
The second graph shows the total ratings/reviews of the 20 users with the most reviews, with the top user having done 285 reviews. The number of reviews a user has in our dataset ranges from 1–285.
After looking at the ratings we wanted to get an idea about our SpaCy vector score and the Sentiment score and see how their values were distributed. We can see on the right that even though the overall range of the sentiment score was -1 to 1 the bulk of the scores fall into a smaller range of -.25 to .5.
And similarly for the spacy score which ranged from 0 to 8.5 but had most of the data in the range of 2.5 to 4. We wanted to keep that in mind when comparing RMSE scores later.
When doing predictions using only user ratings, our top performing models were the Baseline model, the KNN Baseline model and the SVD model. As you can see, the SVD model got a slightly better RMSE score than the Baseline model, and the KNN Baseline model was not too far behind. The baseline model takes user and item biases into account when predicting the ratings. The KNN Baseline model used item-item collaborative filtering. The SVD model is also a collaborative filtering system, and is a technique that reduces dimensionality. Both the KNN Baseline and SVD models also take biases into account.
Next we wanted to look at the sentiment and SpaCy scores that we got from the reviews. We used the SVD recommendation model with these scores to see if they would do a better job than the user rating. We found that the sentiment score worked much better than the SpaCy vector score for our RMSE value. Now we can look at all the results together for our different models to pick our best one.
And finally we get to our final results. We can see here that the best model by RMSE was the one that only used a sentiment score. Even though the range was the smallest the RMSE variance was still smaller than the other models. However in the end we decided to go with the SVD model using user ratings as that one had the most diverse recommendations compared to the sentiment scores which almost exclusively recommended the very top rated anime. However, if a person has fewer than 3 reviews/ratings, we could use the sentiment SVD model to give recommendations to address the cold start problem
Lastly, we wanted to see how this model would perform so we decided to input our own ratings for different anime to see if the recommendations would be something we would like. We took ratings for multiple anime from a classmate and ourselves and looked at the top 10 recommendations. Since we didn’t put all the anime we have seen, it recommended some anime that we had previously watched and enjoyed, so it looks like the model is doing a good job overall.
In conclusion, our best model using just ratings was the collaborative filtering based SVD model, and the model that had the lowest RMSE was the SVD model using the calculated sentiment scores. The downside of the sentiment score SVD model is that it requires user reviews to give recommendations so it is not the best for new users and that also made it harder to test with real people. However, it might be something to use later on once the user base grows and if we can persuade users to actually write reviews for the anime they watch.
Giving a rating is much easier and quicker, so it is more practical and usable for the general audience. This is why we stuck with the ratings SVD model and based on our few tests, it seems to have done a good job, even recommending anime we had not watched before.
There were some things we couldn’t get to that we wish we could have improved upon. One of those was to possibly find and use a different library to create a recommendation model that could take the NLP data and ratings into account simultaneously. Another would be to find a way to add all user ratings/reviews for anime with multiple seasons under one name so that multiple or later seasons of an anime would not come up as recommendations and they would be more meaningful and practical.