Kaggle – Restaurant Revenue Prediction (or: Regression with little training data)

unnamed-chunk-3-1

Executive Summary

A regression analysis using many different models was performed in order to compare model performance. A regression problem from Kaggle was chosen: Restaurant Revenue Prediction. About 20 models were trained and evaluated. Unfortunately, the amount of training data was so small that there was not much performance difference between models using the RMSE metric. The best model found uses Gradient Boosting Machine and achieved position #57 on the private Kaggle leaderboard, out of 2257 teams. The most important predictor is the restaurant’s opening date, which is not useful to predict the revenue of future restaurants.

Description from the Kaggle site

Objective

Predict annual restaurant sales based on objective measurements.

With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world’s most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

Data Fields

  • Id : Restaurant id.
  • Open Date : opening date for a restaurant
  • City : City that the restaurant is in. Note that there are unicode in the names.
  • City Group: Type of the city. Big cities, or Other.
  • Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
  • P1, P2 – P37: There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
  • Revenue: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don’t mean real dollar values.

Evaluation by Root Mean Squared Error (RMSE)

Data Exploration

Load the data and see how much we have.

# get train data from
# https://www.kaggle.com/c/restaurant-revenue-prediction/download/train.csv.zip
train <- read.csv("../input/train.csv", stringsAsFactors=F)
# get test data from
# https://www.kaggle.com/c/restaurant-revenue-prediction/download/test.csv.zip
test <- read.csv("../input/test.csv", stringsAsFactors=F)
dim(train)
## [1] 137  43
dim(test)
## [1] 100000     42

Here is the main problem: only 137 observations for training. As a result:

  • We will not split the training data in train/validation subsets since the resulting datasets would be too small.
  • We can use all of the training data to train a model, using resampling to validate and adjust the tuning parameters.
sum(!complete.cases(train))
## [1] 0

There are no missing cases, aparently.

Let’s have a look at the numerical data as histograms (ref).

library(ggplot2)
library(reshape)
d <- melt(train[,-c(1:5)])
ggplot(d,aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()
melt-1

We can see several variables with a lot of zeros (P14, P15, …). Other than these zeros, there doesn’t seem to be a lot of skew in the predictors. There is some skew in revenue. Let’s investigate the zeros more:

# Display missing-data patterns.
# don't look at Id or revenue
library(mice)
temp = train[,c(-1,-ncol(train))]
temp[temp==0] = NA
md.pattern(temp)
##    P1 P2 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P19 P20 P21 P22 P23 P28 P3 P29
## 48  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  1   1
##  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  1   1
## 85  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  1   1
##  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  0   1
##  2  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1  1   0
##     0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0  1   2
##    P14 P15 P16 P17 P18 P24 P25 P26 P30 P31 P32 P33 P34 P35 P36 P37 P27
## 48   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
##  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0
## 85   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##  2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##     88  88  88  88  88  88  88  88  88  88  88  88  88  88  88  88  89
##    Open.Date City City.Group Type     
## 48         0    0          0    0    4
##  1         0    0          0    0    5
## 85         0    0          0    0   21
##  1         0    0          0    0   22
##  2         0    0          0    0   22
##          137  137        137  137 2048

Looking at row counts:

  • 48 complete observations, 1 missing P27
  • 85 observations missing P14 P15 P16 P17 P18 P24 P25 P26 P30 P31 P32 P33 P34 P35 P36 P37 P27
  • 1 observation missing P3
  • 2 observations missing P29 P14 P15 P16 P17 P18 P24 P25 P26 P30 P31 P32 P33 P34 P35 P36 P37 P27

Column counts indicate that many predictors have 88 zeros. They appear to be a “cluster of NAs”

We can also look at the distribution of the outcome (revenue):

library(lattice)
histogram(train$revenue)
unnamed-chunk-2-1

Conclusions from initial data exploration:

  • No real need to transform predictors;
  • There is a bit of skew in revenue;
  • All these zeros look like missing data and will be converted to NAs prior to imputation;
  • The Open.Date field may be converted into new features (days since opening date, year, month, day)

There is an interesting visual representation of the data here as a geomap of average revenue. It doesn’t help for modeling but is a nice visual touch.

library(methods)
library(googleVis)
library(plyr)
x <- train[, c(3, 43)]
y <- aggregate(x$revenue, list(City=x$City), mean)
colnames(y)[2] <- "Average Revenue"
z <- count(x, 'City')
colnames(z)[2] <- "Restaurants"
x <- merge(y, z, by = "City")
x$Restaurants <- paste0("Restaurants: ", x$Restaurants)
TFI <- x
GT <- gvisGeoMap(TFI, locationvar = 'City', numvar = 'Average Revenue',
                 hovervar = "Restaurants",
                 options = list(region = 'TR', height = 350,
                                dataMode = 'regions'))
print(GT, file = "../output/geomap.html")

Missing Data, Data Transformation and Feature Creation

Imputation for missing data

Impute using Multivariate Imputation by Chained Equations (mice)

  • For imputation methods, see methods(mice)
  • I used the fastpmm method and small values for m and maxit due for speed rf is used normally, but other methods (sample, for example), when there is a lot of data
library(mice)
library(caret)
train[train==0] = NA
tempData <- mice(train, m=1, maxit=2, meth='rf', 
                 seed=501, printFlag=FALSE)
train <- complete(tempData, 1)

The data after imputation (imputed in magenta, observed in blue):

densityplot(tempData)
impute-1
d <- melt(train[,-c(1:5)])
ggplot(d, aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()
impute-2

Yep! The values at zero are gone. Also there is little skew, so we’ll just transform revenue.

Predictor removal: near-zero variance

  • We don’t have so many predictors that we need to remove some due to scale
  • However, we need to check NZV and correlation between predictors
# near zero variance
library(caret)
print (nearZeroVar(train, saveMetrics = FALSE))
## integer(0)

Result: nothing to remove due to near-zero variance.

Predictor removal: correlation

Let’s have a visual look at correlations between numerical variables.

library(corrplot)
nums <- sapply(train, is.numeric)
correlations = cor(train[,nums])
corrplot(correlations, order = "hclust")
unnamed-chunk-3-1.png

This plot shows that there is some correlation between numerical predictors

  • We will investigate the removal of correlated predictors during modeling
  • There is unfortunately very little correlation between predictors and revenue, so that we don’t expect well-performing models.

Let’s look at some scatterplots to see relationships.

featurePlot(train[,c('P1','P2','P3','P4','P5','P6','P7','P8','P9','P10','P11','P12')],
            train$revenue,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            labels = rep("", 2))
scatter-1

featurePlot(train[,c('P13','P14','P15','P16','P17','P18','P19', 'P20','P21','P22','P23','P24')],
            train$revenue,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            labels = rep("", 2))
scatter-2
featurePlot(train[,c('P25','P26','P27','P28','P29','P30', 'P31','P32','P33','P34','P35','P36', 'P37')],
            train$revenue,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            labels = rep("", 2))
scatter-3
  • There doesn’t seem to be much change in revenue with respect to these predictors, so we can expect high RMSE and low R2R2.
  • There are non-linearities so that parametric regression (if used) will have to include polynomial terms.

Outliers

  • Won’t consider this for now.

Leave a Reply

Your email address will not be published. Required fields are marked *