Founding a successful restaurant is hard. 60% fail in 3 years. This pales in comparison to hotel chains that average a 7% failure rate over 10 years. Growing up, my uncle Jim struggled with his restaurant for 3 years before he finally admitted defeat.
Can we use Data to improve the chances of starting a successful restaurant? This month, I decided to find out.
The goal for this project is:
Thanks to the Yelp Challenge, the data I needed was easy to get and the data was shockingly consistent. There were next to no anomalous data points.
One challenge of this project is defining success. I don’t have the profits and losses of individual restaurants so I need to infer success based on observable data we have. I define the function as . The actual form of this function is somewhat complicated and can be seen in the measuring_success
notebook. The top 25% of Yelp Restaurants according to this function were labeled as successful.
One of the key features in my model is the success of businesses in a nearby location. I use a K-d tree to efficiently find the 10 nearest businesses to a given point and then average those success values.
I use latent topics with Latent Dirichlet allocation to make features. This combination of using an unsupervised learning technique to generate features for a supervised model can provide powerful insights.
Unfortunately, latent topics in this case were simply not predictive of success. This is despite coherent categories.
Explicit topics (i.e., categories) perform much better than LDA topics. They end up predicting nearly as well as location in the aggregate.
I tried a number of different classifiers at first. Random Forests performed the best but there was still something wrong with it. It had way too many false positives. If my model predicted that the restaurant would be successful, the restaurant would only actually be successful 50% of the time. This is much better than change (25%) but I thought we can do better.
I adjusted the classifer to have balanced classes in each tree and this decreased the false positive substantially. Now my model had a true positive rate of 66%. This represents more than a 150% increase in the chanse of being successful.
The most important features:
Since most people who want to start restaurants tend not to be very technically savy, I decided to build a web app that allows people to easily use the model I built. Building the app was extremely straightfoward thanks to the awesomeness of Flask.
This is a screenshot of the app I built:
The user workflow is:
Using data to inform our decisions when founding a new business can be extremely valuable.
This project also shows how valuable a good frontend can be for communicating insights in data. Without a good interface it would be difficult for people to make use of this data.