1
votes

I was trying to follow a guide and generate: this sort of plot.

My data is in a data frame called SIGSW.test, and my response variable (SI) is binary. I have a glm that I am using to generate predictions saved as pr.bms in the data frame. I want to graphically represent the true/false positives/negatives at various thresholds. pr.bms.type represents TF, TN, FP, or FN.

However, when I try the following code:

ggplot(data=SIGSW.test, aes(x=SI, y=pr.bms)) + 
   geom_violin(fill=rgb(1,1,1,alpha=0.6), color=NA) + 
   geom_jitter(aes(color=SIGSW.test$pr.bms.type), size=5, alpha=0.6) +
   geom_hline(yintercept=threshold, color="red", alpha=0.6) +
   scale_color_discrete(name = "type") +
   labs(title=sprintf("Threshold at %.2f", threshold))

R generates this image.

It's giving me two columns of data points representing the observed outcome on the X axis with the predicted probability on the Y axis (what I want), but it appears that the two violin plots are combined into one. Since I cannot replicate the author's plot with his own code & data, I suspect that there is a flaw in the code. I'm not very good with ggplot, so I can't figure out exactly what is going wrong- it seems to me that it should be creating two violin plots, one for each outcome, since the violin layer should be using the aesthetic properties defined in the ggplot function. Can anyone explain what's going wrong and how to fix it? I've seen a number of threads on here explaining how to overlay two violin plots, but I can't figure out how to make two violin plots of data defined by a discrete variable. I'd use the by() function if I could, but I don't that works with ggplot2.

For reference, here's a sample of some of my data:

      SI      pr.bms      pr.aic      pr.bic pr.bms.type
19869  0 0.029985210 0.009071122 0.014855376          TN
36670  0 0.013641325 0.018143617 0.019764735          TN
9586   0 0.004428973 0.004363135 0.004356827          TN
41570  1 0.709464654 0.693148738 0.742891240          TP
32356  0 0.347295868 0.274694216 0.284724446          TN
14922  0 0.019798409 0.014157925 0.011422388          TN
52048  0 0.317284825 0.363881394 0.305525690          TN
43269  0 0.972736555 0.985057882 0.909592318          FP
45043  0 0.962467774 0.932087650 0.928091617          FP
4608   0 0.006653427 0.013383884 0.014138802          TN

Thanks

1
Here is the guide I was trying to follow: r-bloggers.com/illustrated-guide-to-roc-and-aucuser17325

1 Answers

2
votes

Never mind, I answered my own question. The response variable was being treated as continuous in both my data and when I was trying to replicate the author's data. I fixed the problem by changing the response variable to a factor.

SIGSW.test$SI<-as.factor(SIGSW.test$SI)

I'm posting the answer instead of deleting this in case anyone else is as dumb as me.