2
votes

I'm using the iris dataset in R. I filtered the dataset so iris$Species == setosa or versicolor. I then created a scatter plot where the x axis is Sepal.Length and the y axis is Sepal.Width. The points were highlighted based on Species, and 2 different linear regression lines were added to the scatter plot based on species.

Here are my questions:

  1. Is it possible to get the slope equations/ slope values for the 2 lines (setosa or versicolor) from the scatter plot? If so, how?
  2. Is it possible to use a statistical test to see if the slope equations/ slope values for the 2 lines (setosa or versicolor) were significantly different from one another?

Please let me know if/when you can.

Thanks ahead of time.

-P.S.

Here is the figure:

Scatter plot of Sepal.Length X Sepal.Width with dots as Species where Species is setosa or versicolor

Here is the R code to generate the plot:


# creates data for scatter plot

## dataset of interest
iris

## for iris
colnames (iris)

### creates dataset with just cases where iris$Species == setosa or versicolor

#### unique values for iris$Species
unique(iris$Species)

#### loads tidyverse package
library(tidyverse)

##### filters dataset with just cases where iris$Species == setosa or versicolor
iris__setosa_or_versicolor <- iris %>% filter(iris$Species != "virginica")

##### turns iris__setosa_or_versicolor to dataframe
iris__setosa_or_versicolor <- data.frame(iris__setosa_or_versicolor)

##### unique values for iris__setosa_or_versicolor$Species
unique(iris__setosa_or_versicolor$Species)

## creates scatter plot

### loads ggplot2
library(ggplot2)

### Basic scatter plot
scatter_plot__sepal_length_x_sepal_width__points_is_species <- ggplot(iris__setosa_or_versicolor, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()
scatter_plot__sepal_length_x_sepal_width__points_is_species

### Basic scatter plot with regression line added
scatter_plot__sepal_length_x_sepal_width__points_is_species <- ggplot(iris__setosa_or_versicolor, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point() + geom_smooth(method=lm, se=FALSE, color="green")
scatter_plot__sepal_length_x_sepal_width__points_is_species

### Basic scatter plot separated by Species
scatter_plot__sepal_length_x_sepal_width__points_is_species <- ggplot(iris__setosa_or_versicolor, aes(x=Sepal.Length, y=Sepal.Width, color=Species, shape=Species)) + geom_point() + geom_smooth(method=lm, se=FALSE, fullrange=TRUE) + labs(title="Scatter plot of Sepal.Length X Sepal.Width with dots as Species where Species is setosa or versicolor", x="Sepal.Length", y = "Sepal.Width") + scale_colour_manual(values = c("#ff0000","#0000ff"))
scatter_plot__sepal_length_x_sepal_width__points_is_species

scatter_plot__sepal_length_x_sepal_width__points_is_species <- 
  scatter_plot__sepal_length_x_sepal_width__points_is_species + theme(panel.background = element_rect(fill = "white", colour = "white", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "lightblue"), panel.grid.minor = element_line(size = 0.25, linetype = 'solid', colour = "lightblue"))
scatter_plot__sepal_length_x_sepal_width__points_is_species

scatter_plot__sepal_length_x_sepal_width__points_is_species <- 
  scatter_plot__sepal_length_x_sepal_width__points_is_species + geom_point(size=3)
scatter_plot__sepal_length_x_sepal_width__points_is_species

### displays scatter plot
scatter_plot__sepal_length_x_sepal_width__points_is_species

EDIT 1:

Response to comment:

What do you mean in 2.? Do you want to also add the result of the test as an annotation on the figure? Or just compare the slopes independently of the figure? Please, edit your question. I will answer once this is clear. (As a general comment, try to avoid including in the example code snippet details that are irrelevant to your question, such as change in background color and size of points.)

I'm interested in comparing the slopes independent of the figure. I want to see if there are differences between the regression lines and how to interpret those differences.

Response to answer:

Run a regression using lm.

Then use ANCOVA on those regressions to see a slope difference.

Thanks. I think I've tried to do what you said. The anova table comparing the model with v. without the interaction was significant. I think this means that there is a difference between the slopes of the regression based on the grouping variable species. Is this interpretation correct?

The code is below. was the code done properly?

Follow up question for this: How do I find the slopes of the 2 regression lines (iris$Species = setosa v. versicolor) based on the figures?

Here is the code with ANCOVA comparing the 2 regressions:


## comparing the slopes of the regression lines using ANCOVA
# ---- NOTE: DV - Sepal.Width
# ---- NOTE: IV - Sepal.Length
# ---- NOTE: grouping variable: Species
# ---- NOTE: dataset: iris__setosa_or_versicolor
# ---- NOTE: based on this site: https://stats.stackexchange.com/questions/51780/how-to-perform-an-ancova-in-r

### create interaction_regression_model
interaction_regression_model <- aov(Sepal.Width~Sepal.Length*Species,data=iris__setosa_or_versicolor)

#### gives summary of interaction_regression_model
summary(interaction_regression_model)

### create no_interaction_regression_model
no_interaction_regression_model <- aov(Sepal.Width~Sepal.Length+Species,data=iris__setosa_or_versicolor)

#### gives summary of no_interaction_regression_model
summary(no_interaction_regression_model)

### compare 2 regression models, using ancova through anova command
anova(no_interaction_regression_model,interaction_regression_model)

2
You can add the equations using stat_poly_eq() from package 'ggpmisc'. See the answers to this old and popular question. There are also other answers using other approaches, but I prefer my approach as described in my answer to this earlier question.Pedro Aphalo
What do you mean in 2.? Do you want to also add the result of the test as an annotation on the figure? Or just compare the slopes independently of the figure? Please, edit your question. I will answer once this is clear. (As a general comment, try to avoid including in the example code snippet details that are irrelevant to your question, such as change in background color and size of points.)Pedro Aphalo

2 Answers

1
votes

The thing to remember is that a plot is a drawing - it's a way to help you visualize and understand your data. It is not the same thing as the data itself. You cannot manipulate, transform, process, convert or statistically analyze a drawing the way you can with data.

Similarly, a regression line drawn on a plot is not the same thing as a linear regression. Yes, the plotting software has to do a linear regression to get the line, but you should not try to extract information about the regression from the plot. This is doing things backwards. If you want to do a regression, do a regression.

To make things simple and equivalent to your data set, we will remove the virginica species from the iris data set:

iris_filtered <- subset(iris, Species != "virginica", drop = TRUE)

Now we perform a linear regression of Sepal.Width according to Species and Sepal.Length. We do this using the function lm . We want to know if the slope of Sepal.Length is different between Species, so we model the interaction between them. The following line does all that:

model <- lm(Sepal.Width ~ Species * Sepal.Length, data = iris_filtered)

Now we review our model:

summary(model)

#> Call:
#> lm(formula = Sepal.Width ~ Species * Sepal.Length, data = iris_filtered)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.72394 -0.16281 -0.00306  0.15936  0.60954 
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -0.5694     0.5352  -1.064 0.290049    
#> Speciesversicolor                1.4416     0.6891   2.092 0.039069 *  
#> Sepal.Length                     0.7985     0.1067   7.487 3.41e-11 ***
#> Speciesversicolor:Sepal.Length  -0.4788     0.1292  -3.707 0.000351 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.2632 on 96 degrees of freedom
#> Multiple R-squared:  0.707,  Adjusted R-squared:  0.6978 
#> F-statistic:  77.2 on 3 and 96 DF,  p-value: < 2.2e-16

Here's what this model tells us:

  • The setosa line crosses the y axis at -0.5694cm, but the p value is not significantly different from 0 (p = 0.29)
  • The versicolor line crosses the y axis 1.4416 cm higher than the setosa line, at 0.8722cm (-0.5694 + 1.4416 = 0.8722). This difference is just statisctically significant at p = 0.039.
  • The Sepal.Width increases 0.7985cm for every 1cm increase of Sepal.Length for the setosa species. This slope is highly significantly different from 0.
  • The Sepal.Width increases by 0.3197cm per cm increase in Sepal.Length (0.7985 - 0.4788 = 0.3917) in species versicolor. This is significantly different from the gradient for setosa (p = 0.000351).

So we have our actual model, and the gradients from it, and we know the difference in slopes is significant and we have done it with 3 lines of code, and we didn't need to plot anything.

Just to prove this works though, let's plot these lines "by hand" over our data to show how the regression looks:

with(iris[iris$Species == "setosa",], 
     plot(Sepal.Length, Sepal.Width, col = "red", xlim = c(4, 7), ylim = c(2, 4.5)))
with(iris[iris$Species == "versicolor",], 
     points(Sepal.Length, Sepal.Width, col = "blue"))
abline(a = -0.5694, b = 0.7985, col = "red")
abline(a = 0.8722, b = 0.3197, col = "blue")

enter image description here

0
votes

Run a regression using lm.

Then use ANCOVA on those regressions to see a slope difference.