1
votes

I would like to estimate the length of individuals based on their age and sex, using OLS in R. To that purpose I built the following model: m1 <- lm(length ~ age + sex, data = data.frame). Next, I created some simple residual plots by running:

op <- par(mfrow = c(2,2))
plot(resid(m1) ~ fitted(m1))
plot(resid(m1) ~ data.frame$age)
plot(resid(m1) ~ data.frame$sex)
qqnorm(resid(m1)); qqline(resid(m1))
par(op)

yielding this plot: enter image description here Strange enough, the fitted values do not seem to have the range [165,180], but rather [165,170] ∪ [175,180] (top left plot). I do not understand why this is happening.

Below some sample data producing the plots above:

    structure(list(length = c(173, 170, 172, 160, 162.5, 180, 179.5, 
175, 168, 186.5, 163.5, 170.5, 160, 175.5, 186.5, 176.5, 168, 
180.5, 179, 167, 183, 188.5, 176, 165, 170, 171, 176, 172, 187, 
189, 180, 175.5, 162.5, 187, 164, 177, 170.5, 159.5, 161.5, 167, 
178, 180.5, 168.5, 162, 171, 173, 171.5, 174.5, 177, 158, 175, 
170, 183.5, 166, 174.5, 174, 176, 165, 163.5, 171.5, 161, 173, 
165, 186, 171, 164.5, 182.5, 156.5, 156, 175, 168.5, 195, 164, 
167.5, 168, 165.5, 172.5, 167, 175, 190, 170.5, 166, 155, 179.5, 
175, 185, 174, 158.5, 172.5, 172.5, 173, 177, 161.5, 173.5, 159, 
181, 176, 181.5, 167.5, 170.5), age = c(31.0965092402464, 67.7481177275838, 
60.9062286105407, 54.776180698152, 57.8316221765914, 42.0287474332649, 
47.1786447638604, 51.315537303217, 68.0876112251882, 32.3613963039014, 
52.1259411362081, 50.7652292950034, 53.6947296372348, 64.6242299794661, 
66.9733059548255, 66.8829568788501, 63.668720054757, 73.533196440794, 
57.7659137577002, 43.7262149212868, 51.2416153319644, 30.7953456536619, 
73.0403832991102, 52.2984257357974, 46.2614647501711, 35.7618069815195, 
74.0670773442847, 35.6878850102669, 43.3894592744695, 65.0458590006845, 
55.9671457905544, 71.2306639288159, 58.5653661875428, 40.0520191649555, 
39.9698836413415, 44.0109514031485, 34.4722792607803, 47.5400410677618, 
51.8822724161533, 46.9596167008898, 39.0143737166324, 49.0349075975359, 
39.3812457221081, 48.2518822724162, 37.0376454483231, 30.425735797399, 
31.5838466803559, 74.9459274469541, 46.3353867214237, 56.0602327173169, 
54.4476386036961, 58.4120465434634, 47.64681724846, 39.047227926078, 
45.2183436002738, 48.0246406570842, 41.5140314852841, 61.0732375085558, 
52.2600958247776, 62.9760438056126, 70.715947980835, 70.5735797399042, 
40.2436687200548, 35.0198494182067, 41.1772758384668, 57.2210814510609, 
64.2710472279261, 59.6221765913758, 63.0088980150582, 48.5366187542779, 
30.0369609856263, 48.8898015058179, 49.7741273100616, 54.7624914442163, 
61.284052019165, 37.0102669404517, 58.4695414099932, 55.3483915126626, 
39.4579055441478, 49.3333333333333, 37.9712525667351, 57.388090349076, 
70.8199863107461, 37.0212183436003, 51.3675564681725, 48.3860369609856, 
35.895961670089, 39.5208761122519, 37.4209445585216, 46.8692676249144, 
65.3826146475017, 51.9425051334702, 33.2594113620808, 55.1156741957563, 
33.9493497604381, 33.2895277207392, 42.0369609856263, 29.4976043805613, 
54.9514031485284, 36.2327173169062), sex = structure(c(1L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 
1L, 2L), .Label = c("0", "1"), class = "factor")), row.names = c(1L, 
2L, 3L, 4L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 
18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L, 29L, 30L, 32L, 
33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 
47L, 48L, 49L, 50L, 51L, 53L, 54L, 56L, 57L, 58L, 59L, 60L, 62L, 
63L, 64L, 65L, 66L, 67L, 68L, 70L, 73L, 74L, 75L, 76L, 77L, 78L, 
79L, 81L, 82L, 85L, 86L, 87L, 88L, 90L, 92L, 94L, 95L, 96L, 97L, 
98L, 99L, 100L, 102L, 103L, 104L, 105L, 106L, 107L, 109L, 110L, 
111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L), class = "data.frame")

Does anyone know the flaw?

1
The reason is probably that you did not include the interaction age:sex in your model.Roland
@Roland Good suggestion, but including age:sex did not solve the problem.Dion
Look at library(ggplot2); ggplot(DF, aes(x = age, y = length, color = factor(sex))) + geom_point() + geom_smooth(method = "lm") There is no "flaw", the fitted lines just predict a perfect separation of length in the given age range.Roland

1 Answers

1
votes

Following @Roland's comment: the plot shows that the within-group variation of the predicted values is much smaller than the observed values.

library(ggplot2); theme_set(theme_bw())
lm1 <- lm(length~age+sex,data=dd)
pp <- expand.grid(age=30:75,sex=factor(0:1))
pp$length <- predict(lm1,newdata=pp)
ggplot(dd,aes(age,length,colour=sex))+
    geom_point()+
    geom_point(data=pp,shape=2)

enter image description here