caret: performing grouped regression with train()

Question

Hopefully this isn't a completely idiotic question. I have a dataset df, n = 2228, p = 19 that describes characteristics of 5 breeds of horses. I would like to model the continuous variable price as a function of the other 17 predictor variables (even mix of categorical and continuous) for each breed by first splitting the data into training and test.

library(tidyverse)
library(caret)
library(glmnet)
# pre- processing reveals no undo correlation, linear dependency or near
# zero variance veriables
train <- df %>% group_by(breed) %>% sample_frac(size = 2/3) %>% droplevels()
test <- anti_join(df, train) %>% droplevels()
# I imagine I should be somehow able to do this in the following step but can't
# figure it out
model <- train(price ~ ., data = train, method = "glmnet")
test$pred <- predict(model, newdata = test)

As far as I can tell I have no issue splitting the data by breed (see the above code). However, I can't figure out how to fit the model grouped by breed. What I would like to do is analogous to the following from the package nlme i.e. lmList(price ~ . |breed, data = df)

code looks reasonable to me, what's your question? check out createDataPartition() in caret it was designed to handle the training/testing split — Nate
gotcha, I don't know how to do training on the fly for multiple group with caret. you could always use split and lapply to crank through it but I bet someone on here can offer a cleaner solution — Nate

Stephen Stephen · Accepted Answer · 2016-09-28T13:30:44

I think what you want to do is something like

horse_typex <- df %>% filter(breed == typex)

for each type of horse, and then split these up into test and training sets.

If you desire to do a linear regression, perhaps you may want to model the log of the price instead, since it is likely skewed.

caret: performing grouped regression with train()

3 Answers