Overview/background

Talking Points

  • Introduction

  • Restaurant Problems

  • Competition Overview

  • Discussion of data

Data Preprocessing/Preparation

Data Description

Our Kaggle dataset contained eight csv files. The data comes from two separate sites:

  • Hot Pepper Gourmet (HPG)
  • AirREGI/Restaurant Board (AIR)

The eight csv file descriptions are as follows:

  • air_reserve - reservations made in the AIR system.
  • hpg_reserve - reservations made in the HPG system.
  • air_store_info - information about select AIR restaurants, like name, genre, and area.
  • store_id_relation - join select restaurants that have both systems.
  • air_visit_data - historical visit data for AIR restaurants.
  • sample_submission - prediction dates and store id.
  • date_info - basic information about the calendar dates.

The data covers the data from 2016 until April 2017. The test set covers the last week of April and May of 2017. The test set is split based on time and covers a chosen subset of the AIR restaurants.

Data Processing

This is a relational dataset from two systems. We decided it would be best to create one spreadsheet that contains the following:

  • Store ID
  • Genre
  • Area
  • Day of week
  • Visit Date
  • Holiday?
  • Number of visitors
  • Total Reservations

This process was completed using Alteryx. We imported all csv files and performed basic preliminary processes. Since the air_reserve and hpg_reserve contained each individual reservation, we used a summarize function to sum the number of people per day, per restaurant.

Critiques of other models

Critique 1 - XGBoost, Gradient Boost, and K-Neighbor Regression

Score: 0.482

  • Link - https://www.kaggle.com/tunguz/surprise-me-2
  • Approach
    • Created predictions from Gradient boost regressor, XG boost regressor, and K Neighbor Regressor
    • Made 1st solution by taking a 30% of model 1, 30% of model 2, 40% of model 3
    • Created 2nd solution by applying weights from weighted means from each day of week and holiday at each
    • Took 70% and 30% percentage of 1st and 2nd solutions and increased predictions by additional 10%
  • Pros:
    • Has the best model, as measured by the lowest RMLSE. Hard to find fault with best approach
    • Considered differences amount restaurants by finding weighted means for every day of week and holiday at restaurant
  • Critique
    • Regarding reproducibility, the model lacked rational or explanation of certain steps and chosen parameters
    • Unclear reasoning or documentation of how weights of forming the initial two solutions were chosen

Critique 2 – Ridge Regression

Score: 0.497

  • Link - https://www.kaggle.com/yixinsunn/simple-ridge-regression-lb-0-497
  • Approach
    • Cross validated a ridge regression model for each store, testing different alpha values
    • Save best performing model for that specific store
    • Looped through stores and fit testing data of specific store to saved best model
  • Pros
    • Creative approach to solving problem
    • Flexibility of predictions by fitting unique model to each store
    • Ridge maintains all variables, assuming that all contribute at least some predictive power
  • Critique
    • Ridge regression is fitting for high dimensionality datasets (n << p) and datasets with multicollinearity, which does not apply to way this user’s dataset
    • Computationally expensive to cross validate a model for each store

Critique 3 Mid level - Time Series

Score: 0.51474

  • Link - https://www.kaggle.com/boikodenys/visitor-forecasting-using-only-ts-0-51474#Visualizations:-EDA
  • Approach
    • Examined volume trends of customers to detect seasonal and time series trends, such as checking if data is stationary and checking for volume differences during holidays
    • Predicted volumes of customers via forecasting package with ETS model
  • Pros
    • Considered lags and seasonality in data to make predictions
    • Explored feature importance by comparing mean differences of various features
    • Constructed many exploratory data visualizations to reveals ranges and means in groups such as genre and area. This analysis revealed potential outliers and trends
  • Critique
    • Did not use independent variables provided in the problem, so one could argue that the entire solution missed the mark
    • Cannot identify important factors that influence or explain customer volume trends

Our model choices

  • XGBoost
    • Parallel processing ensures quick computation speed
    • Uses regularization methods that reduce overfitting and perform feature selection by assigning weights to features
    • Results of numerous other submissions used XGBoost algorithm and is overall, very popular algorithm for supervised learning
  • Random Forest
    • Includes capability of displaying feature importance, which is highly interpretable and useful measurement
    • Works with mixture of categorical and numerical variables and suited for multivariate problems
    • Like XGBoost, tree based method is appropriate for
    • Can handle data sets with high dimensionality
  • Lasso and Ridge
    • Both run for comparison purposes, to provide insight on importance of all versus a portion of variables
    • Both employ regularization techniques to avoid overfitting by tuning alpha parameter
    • Data as transformed into sparse matrix increases dimensions and is fitting for lasso and ridge

Model Code Overview

Data Import

Read in csv file

restaurant <- read.csv("merged_document.csv", header=T, stringsAsFactors = TRUE)

Remove columns for full date, air store if, and air area name

restaurant<-restaurant %>% select(3,5:11)
restaurant<-na.omit(restaurant)

Here, we are changing some string variables to factors.

restaurant$day_of_week <- as.factor(restaurant$day_of_week)
restaurant$genre <- as.factor(restaurant$genre)                      
restaurant$year <- as.factor(restaurant$year)
restaurant$month <- as.factor(restaurant$month)
restaurant$day <- as.factor(restaurant$day)

Model 1: Random Forest Model

Here, we are splitting the dataset into a train and test set, using a 70/30 split.

set.seed(1)
train <- sample(nrow(restaurant), 0.7*nrow(restaurant), replace = FALSE)
TrainSet <- restaurant[train,]
ValidSet <- restaurant[-train,]

We create a bagging model given the \(\texttt{mtry}\) is equal to 7, which is all dependent variables I am using in this model.

bag.model = randomForest(visitors~., data = TrainSet, mtry=7, ntree=500, importance=TRUE)

Summarizing the variables impact on the model

importance(bag.model)
##                      %IncMSE IncNodePurity
## day_of_week        22.552929     148888.07
## holiday_flg         9.319675      13640.68
## Total.Reservations 56.433575     453072.01
## genre              36.729997     327914.71
## year                8.481747      15881.73
## month              11.200741     186718.66
## day                -4.114874     363884.50
varImpPlot(bag.model)

yhat=predict(bag.model, newdata = restaurant[-train,])
plot(yhat, ValidSet$visitors)
abline(0,1)

Here are some statistics/metrics of the random forest model:

(MSE.bag <- mean((yhat - ValidSet$visitors)^2))
## [1] 252.0385
(RMSE.bag <- sqrt(MSE.bag))
## [1] 15.87572
SLE.bag <- ((log(yhat+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.bag <- sqrt(mean(SLE.bag)))
## [1] 0.6125625

Model 2: Lasso Model

This chunk splits the data into a train and test set.

x = model.matrix(visitors~., restaurant)[, -1]
y = restaurant$visitors
n <- nrow(restaurant)

The train data will make up 70% of our Lasso model.

trainIndex <- sample(n, .70 * n)
train.x <- x[trainIndex, ]
test.x <- x[-trainIndex, ]
train.y <- restaurant$visitors[trainIndex]
test.y <- restaurant$visitors[-trainIndex]

Creating a matrix or grid of lambdas:

grid = 10 ^ seq(from = 10, to=-2, length=100)

Here are the lasso and 12-fold cross-validation models:

mod.lasso <- glmnet(train.x, train.y, alpha=1, lambda=grid)
cv.out.lasso <- cv.glmnet(train.x, train.y, alpha=1, lambda=grid, nfolds=12) 
plot(cv.out.lasso)

Here is the output for the best lambda:

bestlam.lasso <- cv.out.lasso$lambda.min

Use lasso model to predict on the test set

lasso.pred <- predict(mod.lasso, 
                      s=bestlam.lasso, 
                      newx=test.x)

Here are some statistics and results about lasso model:

(MSE.lasso <- mean((lasso.pred - test.y)^2))
## [1] 245.7336
(RMSE.lasso <- sqrt(MSE.lasso))
## [1] 15.67589
SLE.lasso = ((log(lasso.pred+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.lasso <- sqrt(mean(SLE.lasso)))
## [1] 0.7526947

Model 3: Ridge Model

Using the data from above, here are the ridge model:

mod.ridge <- glmnet(train.x, train.y, 
                    alpha=0, # ridge
                    lambda=grid)

Here is the code to construct a 12-fold cross-validation model:

cv.out.ridge <- cv.glmnet(train.x, train.y, 
                          alpha=0, 
                          lambda=grid,
                          nfolds=12)

Cross-validation ridge model plot:

plot(cv.out.ridge)

Choose best lambda

bestlam.ridge <- cv.out.ridge$lambda.min

Make prediction on the test set

ridge.pred <- predict(mod.ridge, 
                      s=bestlam.ridge, 
                      newx=test.x)

Here are some model statistics and results:

(MSE.ridge <- mean((ridge.pred - test.y)^2))
## [1] 246.0006
(RMSE.ridge<- sqrt(MSE.ridge))
## [1] 15.68441
SLE.ridge = ((log(ridge.pred+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.ridge<-sqrt(mean(SLE.ridge)))
## [1] 0.7551233

Model 4: XGBoost

Read in data:

data <- read.csv('merged_document.csv', stringsAsFactors = TRUE, encoding = 'UTF-8')

Remove visit date

data <- data[, -4]

Change variable that are strings to factors.

data$year <- as.factor(data$year)
data$month <- as.factor(data$month)
data$day <- as.factor(data$day)
data$holiday_flg <- as.factor(data$holiday_flg)
head(data)
##           air_store_id                                 air_area_name
## 1 air_db80363d35f10926               Hokkaid? Asahikawa-shi 6 J?d?ri
## 2 air_db80363d35f10926               Hokkaid? Asahikawa-shi 6 J?d?ri
## 3 air_37189c92b6c761ec               Hokkaid? Asahikawa-shi 6 J?d?ri
## 4 air_db80363d35f10926               Hokkaid? Asahikawa-shi 6 J?d?ri
## 5 air_de88770300008624 Niigata-ken Niigata-shi Gakk?ch?d?ri 1 Banch?
## 6 air_638c35eb25e53eea                Fukuoka-ken Fukuoka-shi Daimy?
##   day_of_week holiday_flg visitors Total.Reservations
## 1      Friday           1        8                  9
## 2    Saturday           1       94                 36
## 3      Sunday           1      132                  4
## 4      Sunday           1       69                 23
## 5      Sunday           1       29                  2
## 6     Tuesday           0        3                  2
##                                  genre year month day
## 1                   Dining bar:Seafood 2016     1   1
## 2                   Dining bar:Seafood 2016     1   2
## 3   Bar/Cocktail:International cuisine 2016     1   3
## 4                   Dining bar:Seafood 2016     1   3
## 5               Izakaya:Japanese style 2016     1   3
## 6 Italian/French:Spain Bar/Italian Bar 2016     1   5
str(data)
## 'data.frame':    6567 obs. of  10 variables:
##  $ air_store_id      : Factor w/ 63 levels "air_04cae7c1bc9b2a0b",..: 52 52 12 52 53 25 14 38 52 53 ...
##  $ air_area_name     : Factor w/ 31 levels "?saka-fu ?saka-shi ?gimachi",..: 11 11 11 11 17 4 16 22 11 17 ...
##  $ day_of_week       : Factor w/ 7 levels "Friday","Monday",..: 1 3 4 4 4 6 7 7 7 7 ...
##  $ holiday_flg       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1 ...
##  $ visitors          : int  8 94 132 69 29 3 24 25 4 6 ...
##  $ Total.Reservations: int  9 36 4 23 2 2 8 3 8 2 ...
##  $ genre             : Factor w/ 25 levels "Bar/Cocktail:Amusement bar",..: 8 8 3 8 15 11 15 15 8 15 ...
##  $ year              : Factor w/ 2 levels "2016","2017": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month             : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day               : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 3 3 5 6 6 6 6 ...

Make dummy variables using

dmy <- dummyVars(" ~ .", data = data)
trsf <- data.frame(predict(dmy, newdata = data))

Print dimensions of dataframe with dummies

dim(trsf)
## [1] 6567  175

Make x matrix without y variable

x_matrix <- data.matrix(subset(trsf, select = -c(visitors)))
dim(x_matrix)
## [1] 6567  174

Make y vector XGBoost expects data as X Matrix, label as Y factor

y_vector <- trsf$visitors

Split dataset into train and test

train_index <- sample(1:nrow(x_matrix) , nrow(x_matrix)*0.8)
x_train <- x_matrix[train_index, ]
y_train <- y_vector[train_index]

x_test <-  x_matrix[-train_index, ]
y_test <- y_vector[-train_index]

Test depth values from 3 to 12

trial.depth <- seq(from = 3, to = 12, by = 1)
trial.rounds <- seq(from = 10, to =500, by = 20)

param.count <- length(trial.depth)*length(trial.rounds)

Initialize lists to capture results

test_error <- rep(0,param.count)      # capture testing error
train_error <- rep(0,param.count)     # capture training error
depth.value <- rep(0,param.count)     # capture value of tested depth
rounds.value <- rep(0,param.count)    # capture value of tested rounds

Train model for every combination of nrounds and max depth

start_time <- Sys.time()
model.count = 0
for (i in trial.depth){
  for (j in trial.rounds) {
    training.xgb <- xgboost(data=x_train,        # X features in matrix form
                            label= y_train,      # Y in vector
                            eta = 0.5,           # learning rate
                            nthread = 1,         # number of parallel threads 
                            nround = j ,         # number of rounds of predictions
                            object="reg:linear", # regression
                            n.trees=500,         # number of trees
                            max.depth=i,         # number of splits
                            verbose = 1)         # print training error
    model.count = model.count+1
    
    # Predict y hats using the xgboost model
    yhat.train=predict(training.xgb,
                       newdata=x_train,
                       n.trees=500)
    
    yhat.test=predict(training.xgb,
                      newdata=x_test,
                      n.trees=500)
    # save MSE Error
    train_error[model.count] <- mean((yhat.train- y_train)^2)
    test_error[model.count] <- mean((yhat.test-y_test)^2)
    depth.value[model.count] <- i
    rounds.value[model.count] <- j
  }
}

end_time <- Sys.time()

This looping takes a while… For this RMarkdown file, it took:

print(end_time - start_time)
## Time difference of 28.70111 mins

Here are the results in a data frame:

results<- cbind(depth.value, rounds.value, train_error, test_error)
results <- data.frame(results)
(row.minMSE <- which.min(results$test_error)) # display row with min MSE
## [1] 126
min(results$test_error)                       # display minimum MSE
## [1] 174.8464
results[row.minMSE, ]                         # display best parameters
##     depth.value rounds.value train_error test_error
## 126           8           10    109.8243   174.8464

Creating a model with best parameters

best.xgb <- xgboost(data=x_train, # X features have to be in matrix form
                       label= y_train, # Y is a vector
                       eta = .5, # learning rate
                       nthread = 1, # number of paralell threads 
                       nround = 190, # number of rounds of predictions
                       object="reg:linear", # regression (for classification, specify "binary:logistic")
                       n.trees=500, # number of trees
                       max.depth=3, # number of splits
                       verbose = 1) # print training error

Predict y-hats using best model

best.pred = predict(best.xgb,
                    newdata = x_test,
                    n.trees = 500)

Some of the predictions were negative, which do not make sense…

best.pred[best.pred < 0] <- 0

Finally, here are some model results:

rmsle(actual = y_test, predicted = best.pred)
## [1] 0.5158976
rmse(actual = y_test, predicted = best.pred) 
## [1] 13.47924
mean((best.pred - y_test)^2)
## [1] 181.6898

Reproducibility

The Machine Learning Reproducibility Checklist

For all models and algorithms presented, check if you include:

  • A clear description of the mathematical setting, algorithm, and/or model.
  • A clear explanation of any assumptions.
  • An analysis of the complexity (time, space, sample size) of any algorithm.

For any theoretical claim, check if you include:

  • A clear statement of the claim.
  • A complete proof of the claim.

For all datasets used, check if you include:

  • The relevant statistics, such as number of examples.
  • The details of train / validation / test splits.
  • An explanation of any data that were excluded, and all pre-processing step(s).
  • A link to a downloadable version of the dataset or simulation environment.
  • For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control.

For all shared code related to this work, check if you include:

  • Specification of dependencies.
  • Training code.
  • Evaluation code.
  • (Pre-)trained model(s).
  • README file includes table of results accompanied by precise command to run to produce those results.

For all reported experimental results, check if you include:

  • The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.
  • The exact number of training and evaluation runs.
  • A clear definition of specific measure or statistics used to report results.
  • A description of results with central tendency (e.g. mean) & variation (e.g. error bars).
  • The average runtime for each result, or estimated energy cost.
  • A description of the computing infrastructure used.