3
votes

I have a dataframe of ~108m rows of data, in 7 columns. I use this R script to make a boxplot of it:

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  scale_y_log10() +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  ylab(expression(Exposure~to~NO[x])) + 
  xlab(expression(Hour~of~the~day)) +
  ggtitle("Hourly exposure to NOx") +
  theme(axis.text=element_text(size=12, colour="black"),
        axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"),
        legend.position="none")

The graph looks like this. It's pretty much fine, however it would be better to have a value towards the top of the Y axis. I guess it should be something like 1000 given the Y axis is a log10 scale. I'm not sure how to do this though?

enter image description here

Any ideas please?

EDIT: In response to DrDom: Try to add scale_y_log10(breaks=c(0,10,100,1000)). The output of doing that, is this:

enter image description here

The output of doing the following: scale_y_log10(breaks=c(0,10,100,1000), limits=c(0,1000))

Is an error of:

Error in seq.default(dots[[1L]][[1L]], dots[[2L]][[1L]], length = dots[[3L]][[1L]]:
'from' cannot be NA, NaN or infinite

In respnonse to Jaap who suggested the following code:

library(ggplot2)
library(scales)

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  scale_y_continuous(breaks=c(0,10,100,1000,3000), trans="log1p") +
  labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
  theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"), legend.position="none")

It produces this graph. Have I done something wrong? I'm still missing a '1000' tick label? A tick inbetween the 10 and the 100 would also be good given that is where most of the data is?

enter image description here

2
Try to add scale_y_log10(breaks=c(0,10,100,1000)) or scale_y_log10(breaks=c(0,10,100,1000), limits=c(0,1000))DrDom
Hi DrDom. Thanks for your suggestions. The results are added to my post above. Note that the data looks a bit different as running the graph creation again takes about 20 minutes, so I just used a subset of the data for demonstration purpose. Not quite there yet. :-(TheRealJimShady
I have written a function which does that automatically: stackoverflow.com/a/54325289/3082472akraf

2 Answers

3
votes

Instead of using scale_y_log10 you can also use scale_y_continuous together with a log transformation from the scales package. When you use the log1p transformation, you are also able to include a 0 in your breaks: scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p")

Your complete code will then look like this (notice that I also combined the title arguments in labs):

library(ggplot2)
library(scales)

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p") +
  labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
  theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"), legend.position="none")
4
votes

You can modify your log scale by adding arguments breaks= to scale_y_log10(), only there shouldn't be a 0 value because from those values also log is calculated.

df<-data.frame(x=1:10000,y=1:10000)
ggplot(df,aes(x,y))+geom_line()+
      scale_y_log10(breaks=c(1,5,10,85,300,5000))