1
votes

I am trying to calculate the correlation coefficient between the residuals of a linear regression and the independent variable p.

Basically, the linear regression estimates the current sales as a function of the current price p and the past price p1. The vector of current prices mydf$p has length 8, but the residuals is a vector of length 7 because one entry has been deleted due to the NA value of p1.

# lag vector and pad with NAs
# Source: http://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/
lagpad <- function(x, k) {
  if (!is.vector(x)) 
    stop('x must be a vector')
  if (!is.numeric(x)) 
    stop('x must be numeric')
  if (!is.numeric(k))
    stop('k must be numeric')
  if (1 != length(k))
    stop('k must be a single number')
  c(rep(NA, k), x)[1 : length(x)] 
}

mydf <- data.frame(p = c(10, 8, 10, 9, 10, 9, 10, 8))
mydf$p1 <- lagpad(mydf$p,1)
mydf$sales <- with(mydf, 200 - 15 * p + 5 * p1) + rnorm(nrow(mydf), 0,0.13)

model <- lm(data = mydf, formula = 'sales ~ p + p1')

print(summary(model))

print(cor(residuals(model), mydf$p))
# Error in cor(residuals(model), mydf$p) : incompatible dimensions

In this particular case, it is easy to use mydf$p[2:8] instead of mydf$p. However, in general, there may be multiple rows at random locations where then NAs are deleted. How do I access the independent variables that were actually used in the regression after removing the rows containing NA?

One of my attempts was based on the R documentation for lm. I tried to access the "x" matrix through model[['x']] but that did not work.

3

3 Answers

1
votes

You can get the actual data used to fit the model from model$model, and from there the p column:

cor(residuals(model), model$model$p)

Alternatively, is.na(mydf$p1) will tell you which rows in mydf have an NA in column p1:

cor(residuals(model), mydf$p[!is.na(mydf$p1)])

In general, is.na(x) tells us whether elements in x are NA or not:

> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
1
votes

model.matrix(model) seems to be what you are looking for Then you can select the variables you want with [] and the column number or name

1
votes

The x matrix is only created if you specify x=T in your call to lm. Then model$x will give you the value of x (this is more idiomatic that model[['x']].

lm handles missing values by just completely omitting an observation where a value is missing. Maybe you want to do something like:

cor(residuals(model), mydf$p[!is.na(mydf$p)])

?