The dataset can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud
I am trying to use tidymodels to run ranger with 5 fold cross validation on this dataset.
I have have 2 code blocks. The first code block is the original code with the full data. The second code block is almost identical to the first code block, except I have subset a portion of the data so the code runs faster. The second block of code is just to make sure my code works before I run it on the original dataset.
Here is the first code block with the full data:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
Here is the second code block with 1,000 rows:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
###################################################################################
#Testing area#
df <- df %>% arrange(-Class) %>% head(1000)
###################################################################################
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
1.) With the the first code block, I can assign and print cv folds in the console. The Global Enviornment data says cv_folds has 5 obs. of 2 variables. When I View(cv_folds), I have columns labeled splits and id, but there are no rows and no data. When I use str(cv_folds), I get the blank loading line that R is "thinking", but there is not a red STOP icon I can push. The only thing I can do is force quit RStudio. Maybe I just need to wait longer? I am not sure. When I do the same thing with the smaller second code block, str() works fine.
2) My overall goal for this project is to split the dataset into training and testing sets. Then partition the training data with 5 fold cross validation and train a ranger model on it. Next, I want to examine the metrics of my model on the training data. Then I want to test my model on the testing set and view the metrics. Eventually, I want to swap out ranger for something like xgboost. Please give me advice on what parts of my code I can add/modify to improve. I am still missing the portion of testing my model on the testing set.
I think the Predictions portion of this article might be what I'm aiming for.
https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
3) When I use rf_results %>% collect_metrics(), it only shows accuracy and roc_auc. How do I get sensitivity, specificity, precision, and recall?
4) How do I plot importance? Would I use something like this?
rf_fit <- get_tree_fit(all_wf)
vip::vip(rf_fit, geom = "point")
5) How can I drastically reduce the amount of time for the model to train? Last time I ran ranger with 5 fold cross validation using caret on this dataset, it took 8+ hours (6 core, 4.0 ghz, 16gb RAM, SSD, gtx 1060). I am open to anything (ie. restructure code, AWS computing, parallelization, etc.)
Edit: This is another way I have tried to set this up
#ranger model with 5-fold cross validation
rf_recipe <- recipe(Class ~ ., data = df_train)
rf_engine <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_grid <- grid_random(
mtry() %>% range_set(c(1, 20)),
trees() %>% range_set(c(500, 1000)),
min_n() %>% range_set(c(2, 10)),
size = 30)
all_wf <-
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_engine)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
#####
rf_fit <- tune_grid(
all_wf,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(roc_auc),
control = control_grid(save_pred = TRUE)
)
collect_metrics(rf_fit)
rf_fit_best <- select_best(rf_fit)
(wf_rf_best <- finalize_workflow(all_wf, rf_fit_best))