0
votes

I have a problem with confidence intervals and predictions.

I have a data set (called 'data') consisting of 158 observations in 2 different variables, S and N, though for some observations N is not available. I have been able to plot a regression line and 95% confidence intervals using qplot. So far so good. Now, I have a second, completely different, data set (called 'data2') with 127 observations of N and would like to know which S this corresponds to and what the confidence intervals are for these S-values. I can't seem to predict these values. Maybe someone could help me out here?

This is what I tried:

data.lm = lm(data$S~data$N)
newdata = data.frame(data2$N)
predict(data.lm, newdata, interval=c("confidence"))

This gives me a warning message

Warning message:
 'data2' had 127 rows but variables found have 158 rows 

and it gives 158 rows of fit, upper and lower values but they obviously don't belong to my data2 N values.

  fit      lwr      upr
1   37.88919 37.66022 38.11816
2   38.38123 38.23795 38.52451
3         NA       NA       NA
4   37.59720 37.26820 37.92621
5   38.09655 37.92488 38.26823
6   37.77301 37.50590 38.04012
...

Same problem when I try specific values such as

data.lm = lm(data$S~data$N)
newdata = data.frame(N=5)
predict(data.lm, newdata, interval=c("confidence"))

it gives me a warning and the exact same output.

I'm probably being stupid here, but I found a lot of similar questions, and the solution always seemed to be exaclty what I tried.. Why does predict not give me one row of fit, upr and lwr but instead seems to do something to the data the lm is based on?

Thank you very much in advance

EDIT:

The data I used:

structure(list(S = c(36.7735, 36.7735, 36.7735, 36.7735, 36.7735, 
36.7735, 36.7735, 36.7735, 36.7735, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 37.307, 
37.307, 37.307, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 
38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.35525, 38.766, 
38.766, 38.766, 38.766, 38.766, 38.766, 38.766, 38.766, 38.766, 
39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 
39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 
39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 
39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 
39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 39.639, 
39.639), N = c(7.740086957, 9.716043478, NA, 6.567521739, 8.572826087, 
7.273521739, 8.689478261, NA, 8.112565217, 9.370289089, 8.429912766, 
9.178733143, 8.136725442, 9.127494831, 7.91849608, 8.775866462, 
8.733992185, 8.47272603, 8.700879331, 9.57630994, 9.184129237, 
9.501760687, 10.04023077, 9.887214462, 7.947499285, 8.681177515, 
10.14076961, 8.990465816, 10.35920222, 8.793812067, 8.962143225, 
NA, 10.89773618, 9.646558574, NA, 8.708896587, 8.482467842, 9.490473018, 
9.724324492, 9.185016805, 9.367232547, 9.447726264, 10.49359078, 
9.086775124, 8.951230645, 8.438922723, 7.612619197, 8.961837755, 
NA, 8.473436422, 9.487274967, 8.839257463, 8.019280063, 8.829296324, 
9.089621228, 12.66471665, NA, 7.93418751, 8.442549778, 12.43150655, 
12.78812747, 9.499177641, 8.88329767, 12.06733547, 8.694287059, 
8.733657869, 8.976294071, 11.61797642, NA, 9.223855496, 12.14555242, 
9.177782834, 10.50860256, 8.830982089, 9.338875366, 11.10966871, 
9.009297476, 9.114841643, 9.145197506, 7.508668256, 8.49838577, 
11.70012856, 8.859038138, 9.984367135, 11.18147471, 8.504456058, 
9.30440283, 8.491741245, 9.154016228, 7.969788358, 8.890420803, 
9.391405036, 8.023003384, 12.06142165, 10.0134321, 7.829115845, 
8.619827639, 7.965320738, 9.718533292, 9.642541995, 9.221551363, 
9.638749044, 8.728496275, 7.882667305, 8.059467865, 10.88596514, 
11.52200146, 8.465388516, 10.89040717, 8.652714649, 8.570009902, 
9.575021118, 10.20114206, 8.030898045, 9.325947744, 9.383493864, 
NA, 10.98718012, 13.58808295, 9.987675873, 11.59305101, 8.559274188, 
10.87432015, 9.530456451, NA, 13.39915598, 14.50068995, 11.4377845, 
9.874845508, 8.419345084, 9.833591752, 8.734194935, NA, 8.751516192, 
10.74365351, 10.94957982, 11.43931675, 9.26461008, 10.88196331, 
10.01986719, 8.521178027, 8.346310841, 9.116175981, 12.55888826, 
11.55922318, 11.62731629, 9.974676715, 8.659476016, 9.714302784, 
11.69627731, 9.404085345, 8.417580572, 10.26841052, 8.0505316, 
14.56194307, 8.496000239, 8.36501204, 9.105109509)), .Names = c("S", 
"N"), class = "data.frame", row.names = c(NA, -158L))

And the new data set to which I would like to predict S values:

structure(list(N = c(7.01, 8.02, 9.82, 7.83, 7.49, 8.41, 7.92, 
9.7, 7.097, 8, 8.29, 8.34, 7.71, 7.87, 8.782, 8.17, 7.86, 7.665, 
7.715, 10.6, 8.06, 7.53, 8.75, 8.29, 7.89, 8.94, 9.58, 9.26, 
9.91, 11.6, 9.666, 10.96, 8.809, 9.142, 7.193, 8.616, 9.035, 
9.123, 8.102, 8.137, 8.966, 8.333, 6.678, 8.856, 10.96, 8.401, 
9.729, 8.755, 8.199, 9.004, 7.94, 8.84, 8.55, 8.26, 7.93, 9.03, 
10.3, 10.1, 9.23, 8.41, 7.595, 7.351, 7.251, 8.606, 9.35, 7.786, 
7.445, 9.441, 8.844, 8.411, 9.086, 8.609, 7.975, 7.203, 11.88, 
6.786, 8.36, 11.1, 11.5, 11.57, 8.755, 12.64, 7.07, 10.58, 8.47, 
8.13, 8.45, 9.21, 9.36, 10, 10.4, 12.5, 10.1, 10.2, 9.54, 7.78, 
9.12, 8.41, 8.94, 9.22, 12.3, 9.75, 9.13, 10.4, 8.22, 8.4, 10.2, 
9.95, 11.1, 10.6, 9.84, 10.1, 12.7, 8.2, 8.55, 11.6, 10.5, 8.09, 
9.42, 11.2, 12.3, 7.776, 7.007, 7.306, 7.475, 7.469, 9.593)), .Names = "N", 
class = "data.frame", row.names = c(NA, 
-127L))
1
Can you post the data you used? An an example you can use maybe dput(data) and dput(newdata)?RLave

1 Answers

0
votes

This is due to way you have specified your model. You are specifying the original data.frame in the formula, so it will eventually always look for that data rather that the correct variable in newdata.

mdl1 <- lm(mtcars$hp~mtcars$disp)
predict(mdl1,data.frame(disp=1:3))
        1         2         3         4         5         6         7         8 
115.74296 115.74296  92.99022 158.62312 203.25349 144.18388 203.25349 109.92351 
        9        10        11        12        13        14        15        16 
107.34195 119.06836 119.06836 166.41155 166.41155 166.41155 252.25938 247.00875 
       17        18        19        20        21        22        23        24 
238.25770  80.16993  78.85727  76.84453  98.28461 184.87627 178.75054 198.87796 
       25        26        27        28        29        30        31        32 
220.75559  80.30119  98.37212  87.34579 199.31551 109.17967 177.43788  98.67840 
Warning message:
'newdata' had 3 rows but variables found have 32 rows 

What you should do is use the formula to specify only the variable names, and give the original data source to lm via the data argument:

mdl2 <- lm(hp~disp,mtcars)
predict(mdl2,data.frame(disp=1:3))
       1        2        3 
46.17208 46.60964 47.04719