2
votes

A third week into my R class (please be patient with me even if it seems obvious where went wrong!), and I am struggling with a homework problem with using the R ggplot2 library. Using the built in diamonds data frame, the problem is to make a scatter plot regression line for log (carat) and log (price), but plotting only for the Fair and Ideal cut diamonds.

This is what the plot is supposed to look like

enter image description here

A quick background, the 3 variables in question here are carat (num), cut (Fair, Good, Very Good, Premium, Ideal), and price (int).

I start with the following code:

set.seed(123) 
d <- ggplot(diamonds[sample(nrow(diamonds),5000),] #this was provided to us in the homework

d + geom_point(aes(x = log(carat), y = log(price), colour = cut) +
  labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
  stat_smooth(aes(x = log(carat), y = log(price), colour = cut), method = "gam")

Here's what I got

enter image description here

Now, I know this is incorrect, because "colour = cut" shows ALL the cuts, but I only want "Fair" and "Ideal". The professor hinted that we should try diamonds$cut%in%c(...), and so I tried it in many different ways. One of the latest (wrong) code is:

d + geom_point(aes(x = log(carat), y = log(price), colour = diamonds[diamonds$cut%in%c("Fair","Ideal")]), alpha = 0.5) +
 labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
 stat_smooth(aes(x = log(carat), y = log(price), colour = diamonds[diamonds$cut%in%c("Fair","Ideal")]), method = "gam")

I continue to get error messages regardless of where I tried to subset the diamonds$cut (e.g., Length of logical index vector for '[' must equal number of columns, Aesthetics must be either length 1 or the same as the data (5000):colour).

How do I extract just the Fair and Ideal cut to make this graph?

Any help is appreciated!

2
You should filter your data first. See this tutorial suzan.rbind.io/2018/02/dplyr-tutorial-3Tung

2 Answers

1
votes

This is the way to define the data argument to ggplot2 prior to declaring it within the ggplot argument, although I'm not sure how to filter the cut column when it is specified as a mapping variable in aes(colour = cut). Although the plot doesn't appear exactly as it should according to your post if that matters at this point. Hopefully this helps.

library(ggplot2)

set.seed(123)
z <- diamonds[sample(nrow(diamonds),5000),]
z <- z[z$cut %in% c("Fair", "Ideal"),]

d <- ggplot(data = z) +
  geom_point(aes(x = log(carat), y = log(price), colour = cut), alpha = 0.5) +
  labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
  stat_smooth(aes(x = log(carat), y = log(price), colour = cut), method = "gam")
d

Created on 2019-03-21 by the reprex package (v0.2.1)

0
votes

Use subset() to subset the data. One modification is to get exactly as your graph is changing the method to 'auto' in stat_smooth so the line will follow the data points. The chart can't be same always as we are doing random sampling.

library(ggplot2)

df<-diamonds[sample(nrow(diamonds),50000),]

subset(df,cut%in%c("Fair","Ideal"))->df_fair_ideal

ggplot(df_fair_ideal,aes(x=log(carat),y=log(price),color=cut),alpha=0.5)+
  labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
  geom_point()+xlim(min(log(df_fair_ideal$carat)),max(log(df_fair_ideal$carat)))+
  stat_smooth(method = "auto",se=T)

enter image description here