0
votes

Here is the first 9 rows (out of 54) and the first 8 columns (out of 1003) of my dataset

 stream n rates     means          1         2         3         4
 1   Brooks 3   3.0 0.9629152 0.42707006 1.9353659 1.4333884 1.8566225
 2  Siouxon 3   3.0 0.5831929 0.90503736 0.2838483 0.2838483 1.0023212
 3 Speelyai 3   3.0 0.6199235 0.08554021 0.7359903 0.4841935 0.7359903
 4   Brooks 4   7.5 0.9722707 1.43338843 1.8566225 0.0000000 1.3242210
 5  Siouxon 4   7.5 0.5865031 0.50574543 0.5057454 0.2838483 0.4756304
 6 Speelyai 4   7.5 0.6118634 0.32252396 0.4343109 0.6653132 2.2294652
 7   Brooks 5  10.0 0.9637475 0.88984211 1.8566225 0.7741612 1.3242210
 8  Siouxon 5  10.0 0.5804420 0.47501800 0.7383634 0.5482181 0.6430847
 9 Speelyai 5  10.0 0.5959238 0.15079491 0.2615963 0.4738504 0.0000000

Here is a simple plot I have made using the values found in the means column for all rows with stream name Speelyai (18).

enter image description here

The means column is calculated by taking the mean for the entire row. Each column represents 1 simulation. So, the mean column is the mean of 1000 simulations. I would like to plot the actual simulation values on the plot as well. I think it would be informative to not only have the mean plotted (with a line) but also show the "raw" data (simulations) as points. I see that I can use the geom_point(), but am not sure how to get all the points for any row that has the stream name "Speelyai"

THANKS


enter image description here

As you can see, the scales are much different, which I would assume, given these points are results from simulations, or resampling the original data. But How could I overlay these points on my original image in a way that still preserves the visual content? In this image the line looks almost flat, but in my original image we can see that it fluctuates quite a bit, just on a small scale...

2

2 Answers

1
votes

I would suggest reformatting your data in a long format rather than wide. For example:

library("tidyr")
library("ggplot2")
my_data_tidy <- gather(my_data, column, value, -c(stream, n, rates, means))
ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
  geom_point() +
  stat_summary(fun.y = "mean", geom = "line")

Note this will also recalculate the means from your data. If you wanted to use your existing means, you could do:

ggplot(subset(my_data_tidy, stream == "Speelyai"), aes(rates, value)) +
  geom_point() + geom_line(aes(rates, means), data = subset(my_data, stream == "Speelyai"))
1
votes

Agree with @NickKennedy that it's a good idea reshaping your data from wide to long:

library(reshape)
x2<-melt(x,id=c("stream","n","rates"))
x2<-x2[which(x2$variable!="means"),] # this eliminates the entries for means

Now it's time to re-calculate the means:

library(data.table)
setDT(x2)
setkey(x2,"stream")
means.sp<-x2["Speelyai",.(mean.stream=mean(value)),by=rates]

So now you can plot:

library(ggplot2)
p<-ggplot(means.sp,aes(rates,mean.stream))+geom_line()

Which is exactly what you had, so now let's add the points:

p<-p+geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,value))

Notice that in the call to geom_point you need to specifically declare data= as you are working with a different dataset to the one you specified in the call to ggplot.

========== EDIT TO ADD =============

replying to your comments, and borrowing from the answer @akrun gave you here, you'll need to add the calculation of the error and then change the call to geom_point:

df2 <- data.frame(stream=c('Brooks', 'Siouxon', 'Speelyai'), 
      value=c(0.944062036, 0.585852702, 0.583984402), stringsAsFactors=FALSE)
x2$error <- x2$value-df2$value[match(x2$stream, df2$stream)]    

And then change the call to geom_point:

geom_point(data=x2[x2$stream=="Speelyai",],aes(rates,error))