1
votes

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of @tchakravarty in Fitting several regression models with dplyr I wrote the following code:

lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
     do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))

My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?

Thanks a lot in advance!

1
Introduce a column training = sample(c(T, F), size = n(), prob = c(0.3,0.7), replace = TRUE), Then withhold those rows from glm where training == TRUE.AlexR

1 Answers

6
votes

You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.

The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.

A toy example using mtcars (just to illustrate the framework).

library(dplyr)
library(tidyr)
library(purrr)
library(modelr)

analysis <- mtcars %>%
  nest(-cyl) %>%
  unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
  mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
  mutate(pred = map2(model, train, predict)) %>%
  mutate(error = map2_dbl(model, test, rmse))

This:

  1. takes mtcars
  2. nest into a list frame called data by cyl
  3. Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
  4. Map the lm model to each train, store that in model
  5. Map the predict function to model and train and store in pred
  6. Map the rmse function to model and test sets and store in error.

There are probably users out there more familiar than me with the workflow, so please correct/elaborate.