3
votes

Can anyone suggest a dplyr answer to the following question? Split data.frame by country, and create linear regression model on each subset

For completeness, the question and answer from the link are included below.

Question

For reference, here's Josh's question:

I have a data.frame of data from the World Bank which looks something like this;

  country date BirthRate     US.   
4   Aruba 2011    10.584 25354.8
5   Aruba 2010    10.804 24289.1
6   Aruba 2009    11.060 24639.9
7   Aruba 2008    11.346 27549.3
8   Aruba 2007    11.653 25921.3
9   Aruba 2006    11.977 24015.4

All in all there 70 something sub sets of countries in this data frame that I would like to run a linear regression on. If I use the following I get a nice lm for a single country;

andora = subset(high.sub, country == "Andorra")

andora.lm = lm(BirthRate~US., data = andora)

anova(andora.lm)
summary(andora.lm)

But when I try to use the same type of code in a for loop, I get an error which I'll print below the code;

high.sub = subset(highInc, date > 1999 & date < 2012)
high.sub <- na.omit(high.sub)
highnames <- unique(high.sub$country)

for (i in highnames) {
  linmod <- lm(BirthRate~US., data = high.sub, subset = (country == "[i]"))  
}

#Error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  0 (non-NA) cases

If I can get this loop to run I would ideally like to append the coefficients and even better the r-squared values for each model to an empty data.frame. Any help would be greatly appreciated.

Answer

For reference, here's jlhoward's answer (incorporating BondedDust's comment) making use of the *apply functions found in this excellent question: R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

models <- sapply(unique(as.character(df$country)),
                 function(cntry)lm(BirthRate~US.,df,subset=(country==cntry)),
                 simplify=FALSE,USE.NAMES=TRUE)

# to summarize all the models
lapply(models,summary)
# to run anova on all the models
lapply(models,anova)

#This produces a named list of models, so you could extract the model for Aruba as:
models[["Aruba"]]
3
Check several examples in ?do and posts on SOHenrik

3 Answers

7
votes

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.

4
votes

Have a look at the lmList function of the nlme package:

library(nlme)
lmList(BirthRate ~ US. | country, df)

Here, | country is used to create a regression for each individual country.

2
votes

Here is one way to use dplyr. I created Brazil using your data. So you have identical results for the two countries. You see intercept and slope in the final data frame.

DATA & CODE

structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Aruba", "Brazil"), class = "factor"), 
date = c(2011L, 2010L, 2009L, 2008L, 2007L, 2006L, 2011L, 
2010L, 2009L, 2008L, 2007L, 2006L), BirthRate = c(10.584, 
10.804, 11.06, 11.346, 11.653, 11.977, 10.584, 10.804, 11.06, 
11.346, 11.653, 11.977), US. = c(25354.8, 24289.1, 24639.9, 
27549.3, 25921.3, 24015.4, 25354.8, 24289.1, 24639.9, 27549.3, 
25921.3, 24015.4)), .Names = c("country", "date", "BirthRate", 
"US."), class = "data.frame", row.names = c("4", "5", "6", "7", 
"8", "9", "10", "11", "12", "13", "14", "15"))

#   country date BirthRate     US.
#4    Aruba 2011    10.584 25354.8
#5    Aruba 2010    10.804 24289.1
#6    Aruba 2009    11.060 24639.9
#7    Aruba 2008    11.346 27549.3
#8    Aruba 2007    11.653 25921.3
#9    Aruba 2006    11.977 24015.4
#10  Brazil 2011    10.584 25354.8
#11  Brazil 2010    10.804 24289.1
#12  Brazil 2009    11.060 24639.9
#13  Brazil 2008    11.346 27549.3
#14  Brazil 2007    11.653 25921.3
#15  Brazil 2006    11.977 24015.4

group_by(mydf, country) %>%
    do({model = lm(BirthRate ~ US., data = .);
       data.frame(int = coef(model)[1], slope = coef(model)[2])})

#  country      int        slope
#1   Aruba 11.02503 8.393295e-06
#2  Brazil 11.02503 8.393295e-06