Restaurant Tag Predictor

Task: Create a model that can predict the tags or categories that a restaurant should have using user reviews.

Purpose: The purpose of this project is to attempt to build a model to predict the tags(or labels/categories) of a restaurant based on the user reviews for the restaurant. If successful this could help apply relevant tags to restaurants or be used to periodically update the tags. This would be helpful to provide users more accurate search results so they would not miss something because of missing tags.


Data Gathering & Data Prep

Once I had filtered down my list of businesses, I did some basic feature engineering and created dummy columns for tags(categories in the dataset) so that I would be able to both perform some EDA as well as use these columns as my target variables later.

When looking at the reviews dataset, I used the business ids from the filtered businesses to see only the reviews for businesses that were restaurants. The last thing I did was to join the individual reviews for each business into one large cumulative review that would be later used as my input or features for my models.

Data Processing

I then filtered out restaurants that did not have at least one of these tags and at least 50 reviews. The reason for the minimum reviews was to allow more vocabulary that would give a better chance of having more words relevant to the tags attached.

After all the filtering I was left with about 18 thousand restaurants.

NLP(Natural Language Processing)

I made a function that would go through each review and do the following: tokenize the cumulative review, remove stopwords including punctuation and service/quality words(good, bad, service, etc,.), and lastly lemmatize all the words.

After that was done, I performed TF-IDF(Term Frequency-Inverse Document Frequency) which would help give more weight to words that are less common, helping to differentiate between groups.

I also separately vectorized the review data through SpaCy, which used word embeddings to create document vectors for each cumulative review aiming to look at the overall meaning of the words in each document.


I used the BinaryRelevance model from sci-kit multilearn, a multi-label model that separately fits a model for each label, then when predicting, it predicts each label individually and puts the result together. The reason behind this was because many restaurants have labels that may not depend on each other. For example, a restaurant that has a bar can serve almost any kind of food such as pizza or burgers and have both tags; having the bar tag has no effect on whether or not it would have the pizza or burger tags.


So if we take a look at the distribution of the restaurant categories(tags), we see a couple of notable things:

First we see that the number of restaurants in each category ranges from about 1100 to about 5500 restaurants.

Second, we see that even the category with the most restaurants, Food, is not the majority if you looked at it as a single label. This is relevant when we look at the dummy classifier results later.

Next, if we take a look at the counts for the top 30 words in the reviews, we see that many of these words do not seem relevant to categorizing. Food, restaurant, and chicken are some of the most repeated words that may be relevant to classification, but most of the others are not. There are some that are relevant to service or quality so adding some of these to the stopwords might be helpful.


TF-IDF Models

As I mentioned earlier in the EDA section, there was no tag that applied to majority of the restaurants and this led the dummy classifier to incorrectly predict all the labels and led to a precision and F1 score of 0.

The reason that I chose to use Precision as the main metric here is because in the case of predicting restaurant tags, false positives are relatively worse to have compared to false negatives. I believe it is better to miss some tags than to accidentally mislabel a restaurant with false tags because this leads to giving misleading information to consumers and this could be especially detrimental if dietary tags were to be included. The F1 score is also given to give some perspective and context and keep the results from being misleading.

The TF-IDF model with the best precision(and F1) score was the Random Forest classifier as seen above.

SpaCy vector Models

In this case, I chose my best model to be the Random Forest Classifier, the one where the precision was highest, even though the F1 score was lower. This is again because my main metric is Precsion for the reasons mentioned above, and the drop in the F1 score is not significant enough to me to consider the XGBoost model.

Looking a Little Closer

If we take a closer look at a few of the mislabeled restaurants, we see how the model is not giving false positives; all the predicted tags above are applicable to the restaurants, but there are false negatives or missing tags.

One thing that can be seen here is a possible shortcoming of this approach. Since the models are trying to use the user reviews to predict the tags, if there are certain food items that are not brought up in the reviews or rarely mentioned, it would be unable to predict that label. For example, looking at the third restaurant, Humble Wine Bar, it seems to be correctly labeled for the bar aspects, but is missing the pizza label. Reviews about a bar are not likely to talk about the pizza there which is why the model might not have had the information to predict that.


High precision is good because in early stages false positives are worse as false negatives can lead to misinformation, which is especially bad if dietary tags are included. And the relatively low F1 score implies some missed labels.

Next Steps

  • Use Gridsearch to tune models
  • Try assigning different thresholds or weights for different tags
  • Try setting up a neural net
  • Try expanding the Model to include all Businesses and tags, not just Restaurants, although considerations have to be made since that would be much more memory and computationally expensive

Data Science student at Flatiron School

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store