0
votes

I first use grep to obtain all variable names that begin with the preface: "h_." I then collapse that array into a single string, separated with plus signs. Is there a way to subsequently use this string in a linear regression?

For example:

holiday_array <- grep("h_", names(df), value=TRUE)
holiday_string =  paste(holiday_array, collapse=' + ' )
r_3 <- lm(log(assaults) ~ year + month + holiday_string, data = df)

I get the straightforward error variable lengths differ (found for 'holiday_string')

I can do it like this, for example:

  holiday_formula <- as.formula(paste('log(assaults) ~ attend_v + year+ month + ', paste("", holiday_vars, collapse='+')))
  r_3 <- lm(holiday_formula, data = df)

But I don't want to have to type a separate formula construction for each new set of controls. I want to be able to add the "string" inside the lm function. Is this possible?

The above is problematic, because let's say I want to then add another set of control variables to the formula contained in holiday_formula, so something like this:

weather_vars <- grep("w_", names(df), value=TRUE) weather_formula <- as.formula(paste(holiday_formula, paste("+", weather_vars, collapse='+')))

Not sure how you would do the above.

1
I saw that, but I don't want to have to construct a new "as.formula()" call for every regression. I want to be able to use holiday_string like I have tried above to make it clear how each regression is different from the preceding one. Is there a way to do this. For example, I would like to do something like year + month + as.formula(holiday_string), but I can't.Parseltongue
Out of the box, I don't think so. holiday_string is a character string, and log(assaults)~year + month is a formula. I suppose you could overload the '+' function so that it recognizes a string on one side and a formula on the other and sticks them together with as.formula(paste(...)).atiretoo

1 Answers

3
votes

I don't know a simple method for construction of a formula argument different than the one you are rejecting (although I considered and rejected using update.formula since it would also have required using as.formula), but this is an alternate method for achieving the same goal. It uses the "."-expansion feature of R-formulas and relies on the ability of the [-function to accept character argument for column selection:

  r_3 <- lm(log(assaults) ~ attend_v + year+ month + . ,
            data = df[ , c('assaults', 'attend_v', 'year', 'month', holiday_vars] )