0
votes

I am transitioning certain regression tasks from SAS to R. These are garden variety hedonic price regressions run against time-series cross-section sales datasets. As a typical example, consider a dataset called Sales that includes fields ParcelID, SaleYear, SalePrice plus a suite of property characteristics Bdrms, Baths, etc. (ParcelID,SaleYear) is a key for the table, and suppose it's successfully been read into an R dataframe.

I want to augment Sales with a series of annual dummies, e.g. d2000, d2001, ... d2014 based on the value of SaleYear. In SAS/SQL I do this using a select * statement containing a macro with a for loop that creates and names each dummy using a case statement. This yields a new dataset which includes the desired dummies.

Apparently R can do this more elegantly with factor() and model.matrix() and doubtless many other ways as well. My problem is that at this stage of my R career I'm not able to adapt solutions to similar problems posted on stackoverflow to my particular problem.

Also, our naming conventions require that all dummy variable names be of the form d_*.

Then there's the matter of specifying dummies in a regression call. Proc reg in SAS allows an indexed series of explanatory variables with integer suffix to specified in the model statement in an abbreviated form (a numbered range list), e.g. d_2000-d_2002 instead of d_2000 d_2001 d_2002. I believe there is a nice way to do this in R's lm() facility as well. I don't want, however, just to include dummies corresponding to all distinct values in SaleYear less a reference category selected by R. Model variations use different spans of years for development and testing, so I want to be able to conveniently specify the range of annual dummies to be included.

Thanks very much in advance. I realize these are rather naïve questions, but I hope to be able to answer such myself with a little more R practice and some suggestions. The interaction variables will be the next challenge.

Thanks again.

2
There is an interaction function that takes factors as arguments.IRTFM

2 Answers

2
votes

Here is an example using the economics data set in ggplot2 which creates year dummies:

library(ggplot2)

head(economics) 
str(economics)

# convert date to a year and make that a factor
year <- factor(as.POSIXlt(economics$date)$year + 1900)

lm(unemploy ~ pop + year - 1, economics)

Omit the -1 if you would rather have an intercept and have one year dropped.

0
votes

Besides the elegant method from G.G there are other ways to handle ranges. You could use paste or sprintf to construct names or grep or match, all of those options potentially effective within "[" calls to restrict columns passed to the data argument. More complete answers would be possible after you offer more concrete examples.

paste0("d_20", sprintf("%02s", 0:12))
 [1] "d_2000" "d_2001" "d_2002" "d_2003" "d_2004" "d_2005" "d_2006" "d_2007" "d_2008" "d_2009"
[11] "d_2010" "d_2011" "d_2012"