Hopefully this isn't a completely idiotic question. I have a dataset df, n = 2228, p = 19
that describes characteristics of 5
breeds of horses. I would like to model the continuous variable price
as a function of the other 17 predictor variables (even mix of categorical and continuous) for each breed
by first splitting the data into training
and test
.
library(tidyverse)
library(caret)
library(glmnet)
# pre- processing reveals no undo correlation, linear dependency or near
# zero variance veriables
train <- df %>% group_by(breed) %>% sample_frac(size = 2/3) %>% droplevels()
test <- anti_join(df, train) %>% droplevels()
# I imagine I should be somehow able to do this in the following step but can't
# figure it out
model <- train(price ~ ., data = train, method = "glmnet")
test$pred <- predict(model, newdata = test)
As far as I can tell I have no issue splitting the data by breed
(see the above code). However, I can't figure out how to fit the model grouped by breed
. What I would like to do is analogous to the following from the package nlme
i.e. lmList(price ~ . |breed, data = df)
createDataPartition()
in caret it was designed to handle the training/testing split – Natesplit
andlapply
to crank through it but I bet someone on here can offer a cleaner solution – Nate