2
votes

So I was reading this paper, about how regressing x on y doesn't give you an equivalent line to regressing y~x (long story short, you are minimizing different residuals in each case), and I thought I'd see if if made any difference to my data. Unfortunately I can't get ggplot2's stat_smooth to regress x on y.

here is a sample of 100 rows from my data:

> dput(df)
structure(list(Sample1 = c(4.98960685567417, 3.47204999768371, 
-1.02623042506767, -3.32192809488736, 4.92376518190495, 5.18057364383907, 
3.37344268231229, 7.17221940674516, -1.53797027495265, 4.59078735278587, 
-3.32192809488736, 4.13323387842297, -3.32192809488736, 1.55320685650568, 
3.45928530433997, 6.52941436696241, 0.291721646202104, 1.34692630168789, 
2.588263571484, 7.93984265473926, -1.02623042506767, -3.32192809488736, 
5.97393529545249, 4.96272548511184, -3.32192809488736, 3.19245860152084, 
-0.181177668537811, 1.1702151401323, 7.19741957077731, -3.32192809488736, 
3.09679962523049, 6.89612648849508, 4.49710160528899, 2.90303096540262, 
2.44022334122376, 4.53409217014818, 4.7590708357255, 4.72481633684295, 
0.691615956272056, 3.49724548154186, 3.31795986458613, 5.83755830779617, 
4.36304299927157, -0.0295430166578704, 5.02251917128349, 7.38120720286448, 
2.00357119131309, 3.28216732938001, 4.78231786359764, -0.764299371525642, 
3.92619319211129, 0.0406883788967327, 2.85511116340049, 5.66764040870341, 
1.5770354481397, 6.94329764296121, 3.9308239558015, 5.74724483501107, 
5.88247371232242, 3.60547757169518, -3.32192809488736, -0.542695575918393, 
5.28191680995643, 2.77496647475118, -1.34650507208309, -2.74783703944563, 
6.41591061993452, 3.7539739823111, 2.49127670210767, 6.65016370221803, 
6.32032425410637, -3.32192809488736, 0.402556327534611, 2.34628748804301, 
0.89488462084739, 1.1702151401323, -3.32192809488736, 4.72748036766425, 
5.61681844161155, 2.03831731682591, -0.443475473403791, -2.74783703944563, 
-3.32192809488736, 4.10482914538061, 5.3663866285164, 2.15373368651261, 
-3.32192809488736, 5.45745992658215, -0.764299371525642, 1.69060972560112, 
0.171658206055067, 4.46553470954628, 4.93993527938277, 2.23086351589855, 
0.2329385503416, 6.14110973873979, -0.443475473403791, -3.32192809488736, 
4.16502606126426, 5.45906373012262), Sample2 = c(4.87035389289972, 
3.05631305243025, 0.82323632916493, -2.4296633454796, 4.91811337313532, 
5.03007694504678, 3.33859900165761, 6.54820168349871, -2.4296633454796, 
4.86188407042978, -3.32192809488736, 4.62717002820498, -1.67106299528127, 
-3.32192809488736, 2.98790308889454, 6.05226752694174, 0.596977520519267, 
1.99826710332906, 2.23966464285132, 8.01717854320643, -0.00159952300352469, 
-1.48670630366969, 5.43594909921552, 4.95640818811881, -3.32192809488736, 
4.90168009665292, 0.555561013584398, 2.27836717662796, 6.47511546712932, 
-3.32192809488736, 3.01870866264235, 2.08814992134377, 3.65522335205242, 
3.36281234142216, 1.68917367193269, 3.90649842473784, 4.95242469001903, 
4.78780929337046, 3.21768687367553, 3.85621590296538, 3.41005295130021, 
5.53250156756501, 3.7175760684724, 0.676405696816667, 5.14971539956163, 
6.98398467911331, 3.3567969981478, 3.88157621603906, 4.90168009665292, 
-0.0647853872737311, 4.55018561922281, 1.21774415322831, 2.60706104004096, 
5.87421631057342, 2.22652958465421, 7.02865467720292, 3.39838702576746, 
5.19424993242169, 5.32786325515541, 2.67632233586628, -2.80788789937207, 
0.172877949731557, 4.84264318258294, 2.7144567398541, -0.00159952300352469, 
-1.17646253367489, 6.33756586506636, 3.68433673855881, 2.62719159116047, 
6.51076754571708, 6.31214443849056, -1.88247779017497, 0.714537926272974, 
3.17045554987033, 0.32851270527504, 9.05953630113146, -2.80788789937207, 
4.36480327392371, 5.88366056411048, 1.63054822649657, 2.01364185790907, 
-1.04323142745032, -2.80788789937207, 2.88298417549747, 5.60661313436485, 
1.9024101424183, -0.516096487844633, 5.42017137768702, -0.921270583844273, 
2.36482913745982, 2.46888455306976, 4.55807157102875, 5.01105219152827, 
1.19094552584428, 0.987970258058337, 5.55106017924769, 2.15899824583894, 
-3.32192809488736, 3.92287721859118, 5.06185416274169)), .Names = c("Sample1", 
"Sample2"), class = "data.frame", row.names = c(NA, -100L))

When I try to plot using ggplot2, I can easy put the regression y on x on the plot:

> ggplot(df, aes(x=Sample1,y=Sample2)) + geom_point() + stat_smooth(formula = y~x, method = "lm")

But if I try to use anything else in the formula I get a can't find object error:

> ggplot(df, aes(x=Sample1,y=Sample2)) + geom_point() + stat_smooth(formula = y~x, method = "lm") +  stat_smooth(formula = x~y, method = "lm")
Error in eval(expr, envir, enclos) : object 'y' not found

I tried using the original column names (Sample1 and Sample2) instead of x and y, and giving the stat_smooth its own aes, all gave the same error.

I assume this is a name space error: the formula is being evaluated in the wrong environment somehow, but I don't know why one works but not the other, or how to fix the problem.

Any ideas appreciated.

2
Ian, do you think you might un-accept James' answer? It isn't doing what any of us thought. The second line representing the inverse regression model is plotting in a coordinate system that is not what is shown on the plot. Hence the two regression lines don't cross at / pass through the bivariate mean (well one does, but not the inverse line).Gavin Simpson
Oops, sorry, I just ran it and saw that it gave what looked to be correct.Ian Sudbery

2 Answers

2
votes

You need to think about what stat_smooth(formula = x~y, method = "lm") is doing in the context of the plot you are trying to draw.

The error is most likely arising because when formula is evaluated in the formula parsing code, the variable named on the right hand side of ~ (y in the error case) is not in the set of data defined internally by ggplot as corresponding to the x-axis of the plot. ggplot must isolate the x variable data and only look up variables from that data object.

Now back to thinking about your plot. You drew a plot with x on the x-axis and y on the y-axis. The linear model of x ~ y has the data vectors on the opposite "axes". So from the grammar point of view, the definition you gave doesn't make sense. I think what I am trying to say is that the right hand side of the formula can only contain x or functions of x, as that is all that is available internally.

If you want to do this, add the regression line for x ~ y yourself, something like:

mod <- lm(Sample1 ~ Sample2, data = df)
pdat <- with(df, data.frame(Sample2= seq(min(Sample2), max(Sample2), 
                                         length = 100)))
pdat <- transform(pdat, Sample1 = predict(mod, newdata = pdat))

ggplot(df, aes(x=Sample1,y=Sample2)) + 
  geom_point() + 
  stat_smooth(formula = y~x, method = "lm") +
  geom_line(data = pdat, aes(x = Sample1, y = Sample2))

You could add the confidence interval using geom_ribbon() and compute it by using the interval = "confidence" argument to predict.lm. You'll need to modify the

pdat <- transform(pdat, Sample1 = predict(mod, newdata = pdat))

to accommodate this, but that is left as an exercise for the reader.

2
votes

A work-around is to use another aes() call in the smoother call to switch around x and y:

ggplot(df, aes(x=Sample1,y=Sample2)) + 
    geom_point() + 
    stat_smooth(method="lm") + 
    stat_smooth(aes(x=Sample2,y=Sample1), method="lm")