Introduction
Restaurant Problems
Competition Overview
Discussion of data
Our Kaggle dataset contained eight csv files. The data comes from two separate sites:
The eight csv file descriptions are as follows:
The data covers the data from 2016 until April 2017. The test set covers the last week of April and May of 2017. The test set is split based on time and covers a chosen subset of the AIR restaurants.
This is a relational dataset from two systems. We decided it would be best to create one spreadsheet that contains the following:
This process was completed using Alteryx. We imported all csv files and performed basic preliminary processes. Since the air_reserve and hpg_reserve contained each individual reservation, we used a summarize function to sum the number of people per day, per restaurant.
Score: 0.482
Score: 0.497
Score: 0.51474
Read in csv file
restaurant <- read.csv("merged_document.csv", header=T, stringsAsFactors = TRUE)
Remove columns for full date, air store if, and air area name
restaurant<-restaurant %>% select(3,5:11)
restaurant<-na.omit(restaurant)
Here, we are changing some string variables to factors.
restaurant$day_of_week <- as.factor(restaurant$day_of_week)
restaurant$genre <- as.factor(restaurant$genre)
restaurant$year <- as.factor(restaurant$year)
restaurant$month <- as.factor(restaurant$month)
restaurant$day <- as.factor(restaurant$day)
Here, we are splitting the dataset into a train and test set, using a 70/30 split.
set.seed(1)
train <- sample(nrow(restaurant), 0.7*nrow(restaurant), replace = FALSE)
TrainSet <- restaurant[train,]
ValidSet <- restaurant[-train,]
We create a bagging model given the \(\texttt{mtry}\) is equal to 7, which is all dependent variables I am using in this model.
bag.model = randomForest(visitors~., data = TrainSet, mtry=7, ntree=500, importance=TRUE)
Summarizing the variables impact on the model
importance(bag.model)
## %IncMSE IncNodePurity
## day_of_week 22.552929 148888.07
## holiday_flg 9.319675 13640.68
## Total.Reservations 56.433575 453072.01
## genre 36.729997 327914.71
## year 8.481747 15881.73
## month 11.200741 186718.66
## day -4.114874 363884.50
varImpPlot(bag.model)
yhat=predict(bag.model, newdata = restaurant[-train,])
plot(yhat, ValidSet$visitors)
abline(0,1)
Here are some statistics/metrics of the random forest model:
(MSE.bag <- mean((yhat - ValidSet$visitors)^2))
## [1] 252.0385
(RMSE.bag <- sqrt(MSE.bag))
## [1] 15.87572
SLE.bag <- ((log(yhat+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.bag <- sqrt(mean(SLE.bag)))
## [1] 0.6125625
This chunk splits the data into a train and test set.
x = model.matrix(visitors~., restaurant)[, -1]
y = restaurant$visitors
n <- nrow(restaurant)
The train data will make up 70% of our Lasso model.
trainIndex <- sample(n, .70 * n)
train.x <- x[trainIndex, ]
test.x <- x[-trainIndex, ]
train.y <- restaurant$visitors[trainIndex]
test.y <- restaurant$visitors[-trainIndex]
Creating a matrix or grid of lambdas:
grid = 10 ^ seq(from = 10, to=-2, length=100)
Here are the lasso and 12-fold cross-validation models:
mod.lasso <- glmnet(train.x, train.y, alpha=1, lambda=grid)
cv.out.lasso <- cv.glmnet(train.x, train.y, alpha=1, lambda=grid, nfolds=12)
plot(cv.out.lasso)
Here is the output for the best lambda:
bestlam.lasso <- cv.out.lasso$lambda.min
Use lasso model to predict on the test set
lasso.pred <- predict(mod.lasso,
s=bestlam.lasso,
newx=test.x)
Here are some statistics and results about lasso model:
(MSE.lasso <- mean((lasso.pred - test.y)^2))
## [1] 245.7336
(RMSE.lasso <- sqrt(MSE.lasso))
## [1] 15.67589
SLE.lasso = ((log(lasso.pred+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.lasso <- sqrt(mean(SLE.lasso)))
## [1] 0.7526947
Using the data from above, here are the ridge model:
mod.ridge <- glmnet(train.x, train.y,
alpha=0, # ridge
lambda=grid)
Here is the code to construct a 12-fold cross-validation model:
cv.out.ridge <- cv.glmnet(train.x, train.y,
alpha=0,
lambda=grid,
nfolds=12)
Cross-validation ridge model plot:
plot(cv.out.ridge)
Choose best lambda
bestlam.ridge <- cv.out.ridge$lambda.min
Make prediction on the test set
ridge.pred <- predict(mod.ridge,
s=bestlam.ridge,
newx=test.x)
Here are some model statistics and results:
(MSE.ridge <- mean((ridge.pred - test.y)^2))
## [1] 246.0006
(RMSE.ridge<- sqrt(MSE.ridge))
## [1] 15.68441
SLE.ridge = ((log(ridge.pred+1) - log(ValidSet$visitors + 1))^2)
(RMSLE.ridge<-sqrt(mean(SLE.ridge)))
## [1] 0.7551233
Read in data:
data <- read.csv('merged_document.csv', stringsAsFactors = TRUE, encoding = 'UTF-8')
Remove visit date
data <- data[, -4]
Change variable that are strings to factors.
data$year <- as.factor(data$year)
data$month <- as.factor(data$month)
data$day <- as.factor(data$day)
data$holiday_flg <- as.factor(data$holiday_flg)
head(data)
## air_store_id air_area_name
## 1 air_db80363d35f10926 Hokkaid? Asahikawa-shi 6 J?d?ri
## 2 air_db80363d35f10926 Hokkaid? Asahikawa-shi 6 J?d?ri
## 3 air_37189c92b6c761ec Hokkaid? Asahikawa-shi 6 J?d?ri
## 4 air_db80363d35f10926 Hokkaid? Asahikawa-shi 6 J?d?ri
## 5 air_de88770300008624 Niigata-ken Niigata-shi Gakk?ch?d?ri 1 Banch?
## 6 air_638c35eb25e53eea Fukuoka-ken Fukuoka-shi Daimy?
## day_of_week holiday_flg visitors Total.Reservations
## 1 Friday 1 8 9
## 2 Saturday 1 94 36
## 3 Sunday 1 132 4
## 4 Sunday 1 69 23
## 5 Sunday 1 29 2
## 6 Tuesday 0 3 2
## genre year month day
## 1 Dining bar:Seafood 2016 1 1
## 2 Dining bar:Seafood 2016 1 2
## 3 Bar/Cocktail:International cuisine 2016 1 3
## 4 Dining bar:Seafood 2016 1 3
## 5 Izakaya:Japanese style 2016 1 3
## 6 Italian/French:Spain Bar/Italian Bar 2016 1 5
str(data)
## 'data.frame': 6567 obs. of 10 variables:
## $ air_store_id : Factor w/ 63 levels "air_04cae7c1bc9b2a0b",..: 52 52 12 52 53 25 14 38 52 53 ...
## $ air_area_name : Factor w/ 31 levels "?saka-fu ?saka-shi ?gimachi",..: 11 11 11 11 17 4 16 22 11 17 ...
## $ day_of_week : Factor w/ 7 levels "Friday","Monday",..: 1 3 4 4 4 6 7 7 7 7 ...
## $ holiday_flg : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1 ...
## $ visitors : int 8 94 132 69 29 3 24 25 4 6 ...
## $ Total.Reservations: int 9 36 4 23 2 2 8 3 8 2 ...
## $ genre : Factor w/ 25 levels "Bar/Cocktail:Amusement bar",..: 8 8 3 8 15 11 15 15 8 15 ...
## $ year : Factor w/ 2 levels "2016","2017": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 3 3 5 6 6 6 6 ...
Make dummy variables using
dmy <- dummyVars(" ~ .", data = data)
trsf <- data.frame(predict(dmy, newdata = data))
Print dimensions of dataframe with dummies
dim(trsf)
## [1] 6567 175
Make x matrix without y variable
x_matrix <- data.matrix(subset(trsf, select = -c(visitors)))
dim(x_matrix)
## [1] 6567 174
Make y vector XGBoost expects data as X Matrix, label as Y factor
y_vector <- trsf$visitors
Split dataset into train and test
train_index <- sample(1:nrow(x_matrix) , nrow(x_matrix)*0.8)
x_train <- x_matrix[train_index, ]
y_train <- y_vector[train_index]
x_test <- x_matrix[-train_index, ]
y_test <- y_vector[-train_index]
Test depth values from 3 to 12
trial.depth <- seq(from = 3, to = 12, by = 1)
trial.rounds <- seq(from = 10, to =500, by = 20)
param.count <- length(trial.depth)*length(trial.rounds)
Initialize lists to capture results
test_error <- rep(0,param.count) # capture testing error
train_error <- rep(0,param.count) # capture training error
depth.value <- rep(0,param.count) # capture value of tested depth
rounds.value <- rep(0,param.count) # capture value of tested rounds
Train model for every combination of nrounds and max depth
start_time <- Sys.time()
model.count = 0
for (i in trial.depth){
for (j in trial.rounds) {
training.xgb <- xgboost(data=x_train, # X features in matrix form
label= y_train, # Y in vector
eta = 0.5, # learning rate
nthread = 1, # number of parallel threads
nround = j , # number of rounds of predictions
object="reg:linear", # regression
n.trees=500, # number of trees
max.depth=i, # number of splits
verbose = 1) # print training error
model.count = model.count+1
# Predict y hats using the xgboost model
yhat.train=predict(training.xgb,
newdata=x_train,
n.trees=500)
yhat.test=predict(training.xgb,
newdata=x_test,
n.trees=500)
# save MSE Error
train_error[model.count] <- mean((yhat.train- y_train)^2)
test_error[model.count] <- mean((yhat.test-y_test)^2)
depth.value[model.count] <- i
rounds.value[model.count] <- j
}
}
end_time <- Sys.time()
This looping takes a while… For this RMarkdown file, it took:
print(end_time - start_time)
## Time difference of 28.70111 mins
Here are the results in a data frame:
results<- cbind(depth.value, rounds.value, train_error, test_error)
results <- data.frame(results)
(row.minMSE <- which.min(results$test_error)) # display row with min MSE
## [1] 126
min(results$test_error) # display minimum MSE
## [1] 174.8464
results[row.minMSE, ] # display best parameters
## depth.value rounds.value train_error test_error
## 126 8 10 109.8243 174.8464
Creating a model with best parameters
best.xgb <- xgboost(data=x_train, # X features have to be in matrix form
label= y_train, # Y is a vector
eta = .5, # learning rate
nthread = 1, # number of paralell threads
nround = 190, # number of rounds of predictions
object="reg:linear", # regression (for classification, specify "binary:logistic")
n.trees=500, # number of trees
max.depth=3, # number of splits
verbose = 1) # print training error
Predict y-hats using best model
best.pred = predict(best.xgb,
newdata = x_test,
n.trees = 500)
Some of the predictions were negative, which do not make sense…
best.pred[best.pred < 0] <- 0
Finally, here are some model results:
rmsle(actual = y_test, predicted = best.pred)
## [1] 0.5158976
rmse(actual = y_test, predicted = best.pred)
## [1] 13.47924
mean((best.pred - y_test)^2)
## [1] 181.6898
For all models and algorithms presented, check if you include:
For any theoretical claim, check if you include:
For all datasets used, check if you include:
For all shared code related to this work, check if you include:
For all reported experimental results, check if you include: