4
votes

I am trying to use data.table, lapply and a function call to run multiple regressions against the same variable. I would like to get a simple table as output showing each variable and the coefficient of determination for each.

I am using Rstudio 1.2.1335, data.table 1.12.2 The data set I am using is "http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Appendix%20C%20Data%20Sets/APPENC02.txt"

cnames<-c("ID","County","State","Area","Pop","Young","Old","Phys","Beds","Crime","HighSchool","BA","Poverty","Unemploy","PerCapitaIncome","TotalIncome","Region")
df62<-fread("APPENC02.txt", col.names=cnames)
df62[,c("ID", "County","State","Region"):=NULL]
variability<-function(y){
     model<-eval(substitute(lm(Phys~y, data=df62)))
     anova<-anova(model)
     SSR<- anova$`Sum Sq`[1]
     SSE<- anova$`Sum Sq`[2]
     SSTO<-SSR+SSE
     R2<-SSR/SSTO
     return(R2)
}
df62[ , lapply(.SD, variability)]

This works if the last line is:

df62[ , lapply(.SD, Variability), by=Phys]

Error Message when I omit the 'by' clause: "Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, : object 'i' not found"

If I group by the variable 'Phys', I get correct results, but I have each result needlessly repeated.

1
Can you address what is the benefit of using eval(substitute())? - Roman Luštrik
So to clarify, you want to do 13 different regressions where Phys is the dependent variable, and all the other numeric variables are independent? - mysteRious
Yes - 13 different regressions with Phys is dependent variable. - Ed Young
the eval(substitute()) facilitates using the name of the variable in the function. I got the idea from adv-r.had.co.nz/Computing-on-the-language.html - Ed Young

1 Answers

4
votes

We can create the expression with reformulate. Here, we can pass two arugments, 'data' and 'y' and the y would take column names as arguments.

variability<-function(data, y){
     model<- lm(reformulate(y, "Phys"), data=data)
     anova<-anova(model)
     SSR<- anova$`Sum Sq`[1]
     SSE<- anova$`Sum Sq`[2]
     SSTO<-SSR+SSE
     R2<-SSR/SSTO
     return(R2)
}

Select the column names of interest

nm1 <- setdiff(names(df62), "Phys")

Loop through them, apply the function, while the data is .SD

setnames(df62[, lapply(nm1, variability, data = .SD)], nm1)[]
#    Area       Pop      Young          Old      Beds     Crime   HighSchool         BA     Poverty    Unemploy PerCapitaIncome TotalIncome
#1: 0.006095652 0.8840674 0.01432791 9.788323e-06 0.9033826 0.6731538 1.804622e-05 0.05605789 0.004113459 0.002551878       0.0999411   0.8989137

data

cnames<-c("ID","County","State","Area","Pop","Young","Old","Phys","Beds","Crime","HighSchool","BA","Poverty","Unemploy","PerCapitaIncome","TotalIncome","Region")

df62 <- fread("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Appendix%20C%20Data%20Sets/APPENC02.txt", col.names = cnames)
df62[,c("ID", "County","State","Region"):=NULL]