business analytics— what motivates you to open a restaurant there? Implementation Analytics Interpretation Limitation & Conclusion Demo & Visualisation Reference Declaration Appendix: Poster

The total number of restaurants in the USA is now over 616 thousands and keeps an increase of 7 percent per year. Restaurants, as a typical and important business tightly connected with people’s daily life, are often treated as an indicator of local economic scale and resident income, reflecting people’s quality of life in that region. However, there are no previous studies unveiling the real factors that influence the number of restaurants in a region during the last few decades when the total number of restaurants in the USA ever experienced the greatest growth. In this project, we collect data from the largest restaurant review website in the USA named “Yelp” and a demographic information website “City-Data”, make use of statistical models and thereby summarize the result obtained from statistical analysis tools. Eight factors have been proved to be the significant variables that have a linear relationship with the number of restaurants in an area defined by ZIP code. We interpret these eight factors under the social circumstance of the USA and analyse the reason behind. Last but not least, some suggestions have been provided to those people who plan to open a restaurant in USA to help them make a decision on location choosing.

Restaurant is a typical and important business which is tightly connected with people’s daily life and accurately reflects local economic vitality and residents’ income. According to a statistics done by NDP, there are over 616 thousands restaurants opening in the United States and 7 percent increase in total number every year. People always prefer to live in a region with sufficient dining business, therefore it provides great opportunity and motivation to investors to open a restaurant. However, the location choosing is one of the most critical things for an investor to consider. The population, economy, security and other demographic factors of an area may affect the customers’ behavior and consumption, and thereby impact the return of restaurant business. In this project, we are going to make use of the data from the largest dining information website named “Yelp” and one biggest demographic data website in US called “city-data.com”, build an analytic model and then discover the potential demographic factors that may influence the number of restaurant business in a specific region in US. Our finding in this project could not only unveil the connection between demographic data and the economic scale of dining industry in USA, but also provide suggestions to those potential restaurant owners to choose a good location to open their business.

Implementation

Data Selection

Our project goal is to find out the correlation between restaurants and attributes of a location. We selected yelp data set as our source of information of restaurants in USA. The yelp restaurant data set is provided by the USA based restaurant review site YELP. It includes six main object types, business, review, user and others. It has 2.2M reviews and 591K tips provided by 552K users for 77K businesses, along with 566K business attributes, e.g. Opening hours, ambiance, parking availability. The data was collected from mainly Nevada and Arizona.
The data set is available for download and already in processed format. We have chosen the business object as the main reference data for our analysis. The data is provided in JSON format, with detail template as given below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
business
{
'type': 'business',
'business_id': (encrypted business id),
'name': (business name),
'neighborhoods': [(hood names)],
'full_address': (localized address),
'city': (city),
'state': (state),
'latitude': latitude,
'longitude': longitude,
'stars': (star rating, rounded to half-stars),
'review_count': review count,
'categories': [(localized category names)]
'open': True / False (corresponds to closed, not business hours),
'hours': {
(day_of_week): {
'open': (HH:MM),
'close': (HH:MM)
},
...
},
'attributes': {
(attribute_name): (attribute_value),
...
},
}

As we can see from above, each data entry contains the name, type, address, overall rating and other attributes. The zip code of the restaurant is available as part of the address field, which is later used as the joining attribute to join the restaurant data set with the city data set.
We selected
www.city-data.com as the source of city related data. This website allows searching via zip code, and detail information about the city such as average household income, percentage of renters or population per sex offender.
The data is displayed on web page per zip code. Since our restaurant data set contains mainly data from Nevada and Arizona, we have collected all the current zip codes located in the above mentioned states. Then we have written a Python program to crawl the web according to the zip code list and grab the useful data. We selected 31 attributes in each zip code, as listed below.

  • Zip Code
  • Avg. household income
  • Renters number
  • Living cost
  • Population density
  • Males number
  • Females number
  • High school educated %
  • Bachelor educated %
  • Master and up educated %
  • Unemployment %
  • Married %
  • Separated %
  • Widowed %
  • Divorced %
  • Avg. house value
  • Houses number
  • Population per sex offender
  • Lesbian %
  • Gay %
  • Married Household %
  • Unmarried Household %
  • Poverty resident %
  • Median age
  • Household Number
  • Married household number
  • Unmarried household number
  • Number of Married couples with child
  • Number of Single parent households
  • Housing unit without plumbing %
  • Housing unit without kitchen %

Data Processing

The yelp data set is provided in JSON format, with one object type per file and one record per line. We imported the yelp data into SQLite database for easier data cleaning. From there, we performed several commands to calculate some useful statistics of the restaurants in each zip code region, for example the average rating, rating variance, highest rating and so on.
The city data is crawled from web sites and stored in .csv file separated by commas. The city data was also imported into the same database. We used zip code as identifier and joined city data together with the yelp data set. The final data was exported to csv file for further analysis.

Analytics

We use the number of restaurants as the dependent variable, and independent variables are houses (the number of houses and condos), renters (the number of renter-occupied apartments), cost (Mar. 2013 cost of living index in zip code), density (population density), males (males population), females (females population), sexOffenders (the number of residents per sex offenders), medianAge (median resident age), householdNum, inNonFamilyHouseholdNum, numMarriedCouplesWithChild, numSingleParentHouseholds, highschool (the percentage of high school or higher for population 25 years), bachelor (the percentage of bachelor’s degree or higher for population 25 years), professional (the percentage of graduate or professional degree for population 25 years), unemployed (the percentage of unemployed for population 25 years), married (the percentage of married), separated (the percentage of separated), widowed (the percentage of widowed), divorced (the percentage of divorced), lesbian (the percentage of lesbian couples), gay (the percentage of gay men), familyHousehold, unmarriedHousehold (the percentage of households with unmarried partners), povertyResident (residents with income below the poverty level in 2013), housingUnitW/oPlumbing (the percentage of housing units lacking complete plumbing facilities), housingUnitW/oKitchen (the percentage of housing units lacking complete kitchen facilities).

The adjusted R2 tells the percentage of variation explained by only the independent variables that actually affect the dependent variable. It can be used to compare regression results across various regression models with different predictors. The best model is the one with the largest adjusted R2.

There are many possible predictors to predict the number of restaurants which were collected from city-data.com :

elp_data_with_sd_final_data.csv

The method of selecting models

1) Conduct a multiple linear regression on all the predictors and compute the adjusted R2.
2) Each time we remove one predictor from the model of previous round, and then compute the respective adjusted R2. The new model with greatest adjusted R2 is selected as candidate and compared with the one in previous round.
3) If the new adjusted R2 is greater than adjusted R2 obtained from previous round, we repeat the step 2 with the new model. Otherwise the previous model is considered as the optimistic one.

The process of selecting models

linear_regression.R

1) Clean the data.

1
2
3
4
>setwd("C:/Users/linyanting/Desktop/IS5126/final_project/final data")
>yelp<-read.csv("yelp_data_with_sd_final_data.csv",header=TRUE,sep=",")
>g <- complete.cases(yelp)
>cleandatasu <- yelp[g,]

2) Choose all the predictors and computed the adjusted R2 is 0.6204.

1
2
3
4
5
6
7
8
>reg1 <-lm(Number ~ houses+renters+cost+density+males+females+
sexoffenders+medianAge+householdNum+inNonFamilyHouseholdNum+
numMarriedCouplesWithChild+numSingleParentHouseholds+highschool+
bachelor+professional+unemployed+married+separated+widowed+
divorced+lesbian+gay+familyHousehold+unmarriedHousehold+
povertyResident+housingUnitWoPlumbing+housingUnitWoKitchen,
data=yelp)
>summary(reg1)

Call:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
>>lm(formula = Number ~ houses + renters + cost + density + males + 
>females + sexoffenders + medianAge + householdNum + inNonFamilyHouseholdNum +
numMarriedCouplesWithChild + numSingleParentHouseholds +
highschool + bachelor + professional + unemployed + married +
separated + widowed + divorced + lesbian + gay + familyHousehold +
unmarriedHousehold + povertyResident + housingUnitWoPlumbing +
housingUnitWoKitchen, data = yelp)

Residuals:
Min 1Q Median 3Q Max
-341.28 -87.72 -9.60 68.73 432.06

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.989e+02 3.640e+02 -0.821 0.412282
houses -8.891e-04 7.638e-03 -0.116 0.907416
renters 7.869e-02 1.192e-02 6.600 2.17e-10 ***
cost 5.110e+00 3.076e+00 1.661 0.097855 .
density -4.681e-03 4.825e-03 -0.970 0.332872
males 9.023e-03 6.722e-03 1.342 0.180632
females -5.887e-03 9.748e-03 -0.604 0.546446
sexoffenders 7.811e-03 2.189e-03 3.568 0.000425 ***
medianAge 1.779e+00 3.032e+00 0.587 0.557795
householdNum 1.999e-03 6.399e-03 0.312 0.755046
inNonFamilyHouseholdNum -1.217e-02 1.043e-02 -1.167 0.244356
numMarriedCouplesWithChild -1.227e-03 1.008e-02 -0.122 0.903234
numSingleParentHouseholds -8.318e-02 1.625e-02 -5.118 5.86e-07 ***
highschool -3.032e+02 1.771e+02 -1.712 0.088134 .
bachelor 4.023e+02 2.033e+02 1.979 0.048780 *
professional -5.760e+02 3.855e+02 -1.494 0.136258
unemployed 1.565e+02 2.759e+02 0.567 0.571020
married -2.837e+02 2.123e+02 -1.336 0.182550
separated -2.012e+03 1.000e+03 -2.011 0.045290 *
widowed -3.622e+02 3.886e+02 -0.932 0.352076
divorced 8.815e+02 3.279e+02 2.689 0.007621 **
lesbian 1.956e+03 3.017e+03 0.648 0.517296
gay 5.108e+03 2.418e+03 2.113 0.035527 *
familyHousehold 1.295e+02 1.840e+02 0.704 0.482251
unmarriedHousehold -1.496e+02 4.665e+02 -0.321 0.748638
povertyResident -3.289e+01 1.925e+02 -0.171 0.864452
housingUnitWoPlumbing -5.851e+02 4.377e+02 -1.337 0.182423
housingUnitWoKitchen 7.906e+01 2.421e+02 0.327 0.744279
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 135.8 on 270 degrees of freedom
Multiple R-squared: 0.6549, Adjusted R-squared: 0.6204
F-statistic: 18.97 on 27 and 270 DF, p-value: < 2.2e-16

3) Remove one predictor and compute the respective adjusted R2. From the following table, we can see the largest adjusted R2 is 0.6217. So we choose the model which removes the predictors houses, numMarriedCouplesWithChild and povertyResident.

predictor which was removed the corresponding adjusted R2
houses 0.6217
renters 0.5607
cost 0.6179
density 0.6204
males 0.6192
females 0.6212
sexoffenders 0.6039
medianAge 0.6213
householdNum 0.6216
inNonFamilyHouseholdNum 0.6198
numMarriedCouplesWithChild 0.6217
numSingleParentHouseholds 0.5851
highschool 0.6177
bachelor 0.6163
professional 0.6186
unemployed 0.6213
married 0.6193
separated 0.6161
widowed 0.6205
divorced 0.6116
lesbian 0.6212
gay 0.6155
familyHousehold 0.6211
unmarriedHousehold 0.6216
povertyResident 0.6217
housingUnitWoPlumbing 0.6193
housingUnitWoKitchen 0.6216

4) Repeat the step 3).

5) Finally, we can get the best model which is the one with the largest adjusted R2 (0.5173):

1
2
3
4
5
> Number= -366.9 - 0.00681*houses + 0.07062*renters - 5.022*cost +
0.008071* males + 0.008324*sexoffenders –
0.07256*numSingleParentHouseholds - 0.02553*highschool +
0.04206*bachelor - 0.07721*professional - 0.001453*separated +
0.09188*divorced + 0.005921*gay - 0.04219*housingUnitWoPlumbing
1
2
3
4
>reg_final<-lm(formula = Number ~ houses+renters+cost+males+sexoffenders+
numSingleParentHouseholds+highschool+bachelor+professional+
separated+divorced+gay+housingUnitWoPlumbing, data = yelp)
>summary(reg_final)

Call:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
lm(formula = Number ~ houses + renters + cost + males + sexoffenders + 
numSingleParentHouseholds + highschool + bachelor + professional +
separated + divorced + gay + housingUnitWoPlumbing, data = yelp)

Residuals:
Min 1Q Median 3Q Max
-328.17 -83.37 -7.53 69.00 448.90

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.669e+02 2.532e+02 -1.449 0.148375
houses -6.810e-03 4.119e-03 -1.653 0.099383 .
renters 7.062e-02 5.994e-03 11.782 < 2e-16 ***
cost 5.022e+00 2.696e+00 1.863 0.063514 .
males 8.071e-03 3.348e-03 2.411 0.016565 *
sexoffenders 8.324e-03 2.124e-03 3.918 0.000112 ***
numSingleParentHouseholds -7.256e-02 1.175e-02 -6.174 2.29e-09 ***
highschool -2.553e+02 1.371e+02 -1.862 0.063596 .
bachelor 4.206e+02 1.744e+02 2.412 0.016506 *
professional -7.721e+02 3.244e+02 -2.380 0.017963 *
separated -1.453e+03 8.596e+02 -1.691 0.092013 .
divorced 9.188e+02 2.650e+02 3.467 0.000607 ***
gay 5.921e+03 2.196e+03 2.696 0.007437 **
housingUnitWoPlumbing -4.219e+02 3.660e+02 -1.153 0.249969
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 134.6 on 284 degrees of freedom
Multiple R-squared: 0.6435, Adjusted R-squared: 0.6272
F-statistic: 39.44 on 13 and 284 DF, p-value: < 2.2e-16

Diagnostics

1
> plot(reg_final)

We can see reasonably equally spread residuals around a horizontal line without distinct patterns, which indicates that non-linear pattern does not exist.

The residuals are lined well on the straight dashed line, so it’s a good model.

In this diagram, the residuals spread randomly along the horizontal line, which indicates that the assumption of equal variance is reasonable.

In this diagram, there is no spot outside the dashed line. There are no influential cases which may alter the results.

Significant variables

The variables highlighted in red color(renters, males, sexoffenders, numSingleParentHouseholds, bachelor, professional, divorced, gay) have significant coefficient relationship with restaurant number. Because their corresponding |t value|>= 1.96(5% significance level) and p-value < 0.05.

Interpretation

Factor 1 - Renters

In USA, 51% renters are under 30, mostly renting a room in big cities, living with high housing cost. One of the most direct interpretations for this factor is that the renters may not have time to cook, or they have no sufficient cooking facilities in their rented house. They are the group of people who provide the demand for restaurants.
On the other hand, cities with a large number of renters are generally considered as big cities, which bring with a high living standard and large population. Hence the number of restaurants is supposed to be large.

Factor 2 - Males Number

One of the interpretations for this factor is that more males implies larger population and hence higher consuming demand for food and dining business.

Factor 3 - Number of Single Parents Households

Generally speaking, comparing with other types of families, single-parent families are more likely to have limited financial resources to cover their life expenses, especially the single-mother households. According to a research in US, 7 in 10 children living with mother are living poorly; this percentage is much higher than other types of families. Hence single-parent families are more likely to choose home cooking, rather than going to restaurant, to save the expenses.

Factor 4 & 5 - Percentage of Bachelor and Graduate/Professional Degree Holder

Bachelor degree describes the education level of a man and could explicitly reflect the work nature and consumption capacity. Well-educated people usually spend more time onto work or social activity and desire higher quality of life, which make them the potential customers of restaurants. However, people with even higher degree of Graduate / Professional tend to follow their daily routine and may not be willing to invest the time and energy to try new restaurants.

Factor 6 & 7 - Divorced and Gay Percentage

Higher divorce percentage may indicate a higher percentage of people who like to make changes. They do not adhere to the old habits and likely not to stay at home to do home cooking. Instead they might dine out more and try new restaurants.
Higher gay percentage may indicate the higher percentage of People who are more open to break the tradition and accept new ideas. It also indicates the acceptance level of the public to new things in the region.
It might be a great idea to open a new restaurant in places whereby people are more likely to accept and make change and try something new.

Factor 8 Population per Sex Offender

Higher population per sex offender means lower number of sex offenders, which indicates better security and social morality.
Customers prefer to live and consume at a safe and secure place. Meanwhile, investors also prefer to do business at a safe area to keep their money secure. That could explain why the number of restaurants has negative correlation coefficient with the cases number of sex offender.

Limitation & Conclusion

  1. Inadequate amount of data
    The restaurant data from Yelp is based on 3 states only, which has some impact on the result because it is not a full reflection of the whole USA restaurant business.
  2. Lack of time-series data
    The data is only a cross-sectional data (both Yelp and city-data), but not time-series. This brings difficulty in doing causal analysis and other further analysis.
    Moreover, the restaurant data and the data from city-data.com may not be perfectly matching as there could be some differences in the date when the data was recorded.
  3. Limitation of linear regression model
    The adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model.
  4. Other unquantifiable factors
    The data used does not include several other unquantifiable factors, such as style of food, scale of a restaurant, etc. These factors would bring a big difference to the result analysis. For instance, a multi-vendor canteen may classified as only one restaurant, but it provides a large variety of food selection and options. It could be sufficient enough to have only one such multi-vendor canteen in a region.
  5. Inter-relationship with other restaurants
    In this project, inter-relationship with other restaurants is too complicated and difficult to be considered in doing the statistical analysis. For example, restaurant with competitive relationship or complementary relationship.

Conclusion

If you are going to open a new restaurant in USA. This project will provide some suggestions in selecting the location of your business. You are recommended to consider these eight factors provided in this project when choosing the location, instead of simply referring to local GPD per Capita or population.

Demo & Visualisation

Data of Las Vegas, NV






Data of Phoenix, AZ






Reference

[1] Kusisto, L., & Hudson, K. (2015). Renters Are Majority in Big U.S. Cities. Retrieved from http://www.wsj.com/articles/renters-are-majority-in-big-u-s-cities-1423432009
[2] NPD. (2013). U.S. Total Restaurant Count Increases by 4,442 Units over Last Year. Retrieved from https://www.npd.com/wps/portal/npd/us/news/press-releases/us-total-restaurant-count-increases-by-4442-units-over-last-year-reports-npd/
[3] Steen, K. V. (2010). Using R for Linear Regression. Retrieved from
Wikipedia. (2016). Gender pay gap in the United States. Retrieved from
https://en.wikipedia.org/wiki/Gender_pay_gap_in_the_United_States#cite_note-jec_p80-3

Declaration

This project is carried out in class of IS5126 Hands-on with Business Analytics at National University of Singapore, School of Computing. The project team are comprised of 4 members (please refer to the poster attached below). This project uses R and database techniques to conduct a simple analytics.

Appendix: Poster