Task: Create a model that can predict the tags or categories that a restaurant should have using user reviews.
Purpose: The purpose of this project is to attempt to build a model to predict the tags(or labels/categories) of a restaurant based on the user reviews for the restaurant. If successful this could help apply relevant tags to restaurants or be used to periodically update the tags. This would be helpful to provide users more accurate search results so they would not miss something because of missing tags.
Data Gathering & Data Prep
I collected data from the Yelp Dataset, which can be found on the Yelp website. I was mainly looking at the business data and the review data. They contained data for 200k+ businesses and 8M+ reviews. I then took a subset of the businesses, looking at only those businesses with the “Restaurants” tag.
Once I had filtered down my list of businesses, I did some basic feature engineering and created dummy columns for tags(categories in the dataset) so that I would be able to both perform some EDA as well as use these columns as my target variables later.
When looking at the reviews dataset, I used the business ids from the filtered businesses to see only the reviews for businesses that were restaurants. The last thing I did was to join the individual reviews for each business into one large cumulative review that would be later used as my input or features for my models.
Once the data was prepared and the dummy columns made, there were hundreds of tags. Because this is a multi-label problem and not a multi-class problem, the models become more time consuming and computationally expensive. For the sake of time and computer memory, I limited them to the top 20 tags.
I then filtered out restaurants that did not have at least one of these tags and at least 50 reviews. The reason for the minimum reviews was to allow more vocabulary that would give a better chance of having more words relevant to the tags attached.
After all the filtering I was left with about 18 thousand restaurants.
NLP(Natural Language Processing)
Before I could use each restaurant’s cumulative review in my models, I needed to first perform some NLP on the reviews.
I made a function that would go through each review and do the following: tokenize the cumulative review, remove stopwords including punctuation and service/quality words(good, bad, service, etc,.), and lastly lemmatize all the words.
After that was done, I performed TF-IDF(Term Frequency-Inverse Document Frequency) which would help give more weight to words that are less common, helping to differentiate between groups.
I also separately vectorized the review data through SpaCy, which used word embeddings to create document vectors for each cumulative review aiming to look at the overall meaning of the words in each document.
For modeling I used sci-kit learn to split my data into a training and test set and sci-kit multilearn to fit multi-label models to the training set and predict on the testing set.
I used the BinaryRelevance model from sci-kit multilearn, a multi-label model that separately fits a model for each label, then when predicting, it predicts each label individually and puts the result together. The reason behind this was because many restaurants have labels that may not depend on each other. For example, a restaurant that has a bar can serve almost any kind of food such as pizza or burgers and have both tags; having the bar tag has no effect on whether or not it would have the pizza or burger tags.
So if we take a look at the distribution of the restaurant categories(tags), we see a couple of notable things:
First we see that the number of restaurants in each category ranges from about 1100 to about 5500 restaurants.
Second, we see that even the category with the most restaurants, Food, is not the majority if you looked at it as a single label. This is relevant when we look at the dummy classifier results later.
Next, if we take a look at the counts for the top 30 words in the reviews, we see that many of these words do not seem relevant to categorizing. Food, restaurant, and chicken are some of the most repeated words that may be relevant to classification, but most of the others are not. There are some that are relevant to service or quality so adding some of these to the stopwords might be helpful.
When doing the modeling, I began with a baseline model. In this case I ran a dummy classifier that would predict the majority class using Binary Relevance. I then ran a simple Gaussian Naive Bayes model that is commonly used with text classification. It is also a fairly quick model to run which was also helpful with the amount of text data being used. I lastly used a more advanced model with the Random Forest Classifier. The results are summarized below.
As I mentioned earlier in the EDA section, there was no tag that applied to majority of the restaurants and this led the dummy classifier to incorrectly predict all the labels and led to a precision and F1 score of 0.
The reason that I chose to use Precision as the main metric here is because in the case of predicting restaurant tags, false positives are relatively worse to have compared to false negatives. I believe it is better to miss some tags than to accidentally mislabel a restaurant with false tags because this leads to giving misleading information to consumers and this could be especially detrimental if dietary tags were to be included. The F1 score is also given to give some perspective and context and keep the results from being misleading.
The TF-IDF model with the best precision(and F1) score was the Random Forest classifier as seen above.
SpaCy vector Models
For the models using the SpaCy vectorized features, I did not run a baseline as the results would be the same as the earlier one for the TF-IDF data. I again used the Gaussian Naive Bayes as a simple model and the Random Forest Classifier as an advanced model. In this case, due to the document vectors having 300 features instead of the 100,000+ features in the TF-IDF data, I was able to run an XGBoost classifier as well.
In this case, I chose my best model to be the Random Forest Classifier, the one where the precision was highest, even though the F1 score was lower. This is again because my main metric is Precsion for the reasons mentioned above, and the drop in the F1 score is not significant enough to me to consider the XGBoost model.
Looking a Little Closer
If we take a closer look at a few of the mislabeled restaurants, we see how the model is not giving false positives; all the predicted tags above are applicable to the restaurants, but there are false negatives or missing tags.
One thing that can be seen here is a possible shortcoming of this approach. Since the models are trying to use the user reviews to predict the tags, if there are certain food items that are not brought up in the reviews or rarely mentioned, it would be unable to predict that label. For example, looking at the third restaurant, Humble Wine Bar, it seems to be correctly labeled for the bar aspects, but is missing the pizza label. Reviews about a bar are not likely to talk about the pizza there which is why the model might not have had the information to predict that.
In conclusion, the Random Forest had the highest Precision score in both the TF-IDF and SpaCy vectorized models. In addition, SpaCy vectors used less memory and reduced complexity, so the Random Forest using SpaCy vectors was chosen as the better model.
High precision is good because in early stages false positives are worse as false negatives can lead to misinformation, which is especially bad if dietary tags are included. And the relatively low F1 score implies some missed labels.
If there had been more time to do this project or if I revisit this in the future some thing I might try are:
- Use Gridsearch to tune models
- Try assigning different thresholds or weights for different tags
- Try setting up a neural net
- Try expanding the Model to include all Businesses and tags, not just Restaurants, although considerations have to be made since that would be much more memory and computationally expensive