22
votes

I have 54 points. They represent offer and demand for products. I would like to show there is a break point in the offer.

First, I sort the x-axis (offer) and remove the values that appears twice. I have 47 values, but I remove the first and last ones (doesn't make sense to consider them as break points). Break is of length 45:

Break<-(sort(unique(offer))[2:46])

Then, for each of these potential break points, I estimate a model and I keep in "d" the residual standard error (sixth element in model summary object).

d<-numeric(45)
for (i in 1:45) {
model<-lm(demand~(offer<Break[i])*offer + (offer>=Break[i])*offer)
d[i]<-summary(model)[[6]] }

Plotting d, I notice that my smaller residual standard error is 34, that corresponds to "Break[34]": 22.4. So I write my model with my final break point:

model<-lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)

Finally, I'm happy with my new model. It's significantly better than the simple linear one. And I want to draw it:

plot(demand~offer)
i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])

But I have a warning message:

Warning message:
In predict.lm(model, list(offer)) :
  prediction from a rank-deficient fit may be misleading

And more important, the lines are really strange on my plot.

My plot with the supposedly two segments, but not joining

Here are my data:

demand <- c(1155, 362, 357, 111, 703, 494, 410, 63, 616, 468, 973, 235,
            180, 69, 305, 106, 155, 422, 44, 1008, 225, 321, 1001, 531, 143,
            251, 216, 57, 146, 226, 169, 32, 75, 102, 4, 68, 102, 462, 295,
            196, 50, 739, 287, 226, 706, 127, 85, 234, 153, 4, 373, 54, 81,
            18)
offer <- c(39.3, 23.5, 22.4, 6.1, 35.9, 35.5, 23.2, 9.1, 27.5, 28.6, 41.3,
           16.9, 18.2, 9, 28.6, 12.7, 11.8, 27.9, 21.6, 45.9, 11.4, 16.6,
           40.7, 22.4, 17.4, 14.3, 14.6, 6.6, 10.6, 14.3, 3.4, 5.1, 4.1,
           4.1, 1.7, 7.5, 7.8, 22.6, 8.6, 7.7, 7.8, 34.7, 15.6, 18.5, 35,
           16.5, 11.3, 7.7, 14.8, 2, 12.4, 9.2, 11.8, 3.9)
3
54 points does not seem a very large amount of points to detect such a transistion. You could opt to also post this in stats.stackexchange.com with the specfic question if this is enough points to detect the break in the data. Just my 2 ct.Paul Hiemstra
I think this is pretty statistically dubious. Better to estimate the breakpoint in the model itself (although that makes it non linear). You can't trust the p-vals or standard errors from the current informal estimation process.hadley
54 points isn't a large amount, I agree, but both my linear and my piecewise linear regressions are significant. And moreover, the residual standard error of the piecewise linear model is 103.9 compared to 121.3 for the linear model with two less degrees of freedom. The piecewise model is significantly better.Antonin

3 Answers

31
votes

Here is an easier approach using ggplot2.

require(ggplot2)
qplot(offer, demand, group = offer > 22.4, geom = c('point', 'smooth'), 
   method = 'lm', se = F, data = dat)

EDIT. I would also recommend taking a look at this package segmented which supports automatic detection and estimation of segmented regression models.

enter image description here

UPDATE:

Here is an example that makes use of the R package segmented to automatically detect the breaks

library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx - 35, 0) - 1.5*pmax(xx - 70, 0) + 15*pmax(zz - .5, 0) + 
  rnorm(100,0,2)
dati <- data.frame(x = xx, y = yy, z = zz)
out.lm <- lm(y ~ x, data = dati)
o <- segmented(out.lm, seg.Z = ~x, psi = list(x = c(30,60)),
  control = seg.control(display = FALSE)
)
dat2 = data.frame(x = xx, y = broken.line(o)$fit)

library(ggplot2)
ggplot(dati, aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = dat2, color = 'blue')

segmented

7
votes

Vincent has you on the right track. The only thing "weird" about the lines in your resulting plot is that lines draws a line between each successive point, which means that "jump" you see if it simply connecting the two ends of each line.

If you don't want that connector, you have to split the lines call into two separate pieces.

Also, I feel like you can simplify your regression a bit. Here's what I did:

#After reading your data into dat
Break <- 22.4
dat$grp <- dat$offer < Break

#Note the addition of the grp variable makes this a bit easier to read
m <- lm(demand~offer*grp,data = dat)
dat$pred <- predict(m)

plot(dat$offer,dat$demand)
dat <- dat[order(dat$offer),]
with(subset(dat,offer < Break),lines(offer,pred))
with(subset(dat,offer >= Break),lines(offer,pred))

which produces this plot:

enter image description here

4
votes

The weird lines are simply due to the order in which the points are plotted. The following should look better:

i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])

The warning comes from the fact that the * character is interpreted by lm.

> lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)
Call:
lm(formula = demand ~ (offer < 22.4) * offer + (offer >= 22.4) * offer)
Coefficients:
            (Intercept)         offer < 22.4TRUE                    offer  
                -309.46                   356.08                    29.86  
      offer >= 22.4TRUE   offer < 22.4TRUE:offer  offer:offer >= 22.4TRUE  
                     NA                   -20.79                       NA  

In addition, (offer<22.4)*offer is a discontinuous function: this is where the discontinuity comes from.

The following should be closer to what you want.

model <- lm(
  demand ~ ifelse(offer<22.4,offer-22.4,0) 
           + ifelse(offer>=22.4,offer-22.4,0) )