0
votes

Hopefully this isn't a completely idiotic question. I have a dataset df, n = 2228, p = 19 that describes characteristics of 5 breeds of horses. I would like to model the continuous variable price as a function of the other 17 predictor variables (even mix of categorical and continuous) for each breed by first splitting the data into training and test.

library(tidyverse)
library(caret)
library(glmnet)
# pre- processing reveals no undo correlation, linear dependency or near
# zero variance veriables
train <- df %>% group_by(breed) %>% sample_frac(size = 2/3) %>% droplevels()
test <- anti_join(df, train) %>% droplevels()
# I imagine I should be somehow able to do this in the following step but can't
# figure it out
model <- train(price ~ ., data = train, method = "glmnet")
test$pred <- predict(model, newdata = test)

As far as I can tell I have no issue splitting the data by breed (see the above code). However, I can't figure out how to fit the model grouped by breed. What I would like to do is analogous to the following from the package nlme i.e. lmList(price ~ . |breed, data = df)

3
code looks reasonable to me, what's your question? check out createDataPartition() in caret it was designed to handle the training/testing splitNate
@NathanDay Sorry, clarified questionuser6571411
gotcha, I don't know how to do training on the fly for multiple group with caret. you could always use split and lapply to crank through it but I bet someone on here can offer a cleaner solutionNate

3 Answers

1
votes

I think what you want to do is something like

horse_typex <- df %>% filter(breed == typex)

for each type of horse, and then split these up into test and training sets.

If you desire to do a linear regression, perhaps you may want to model the log of the price instead, since it is likely skewed.

0
votes

Try:

models <- dlply(df, "breed", function(d_breed) 
  train(price ~ ., data = d_breed, method = "glmnet"))
0
votes

I recommend trying to use purrr

library(purrr)

models <- train %>% 
            split(.$breed) %>% 
            map(~train(.$price ~ ., data = ., method = "glmnet")) 

or with dplyr

models <- train %>% 
            group_by(breed) %>% 
            do(train(price ~ ., data = ., method = "glmnet")) 

It is difficult to know if this will work, but it is worth a try.