I have a binomial glm
of a presence/absence response variable and a factor variable with 9 levels like this:
data$y<-factor(data$y,levels=c(0,1),labels=c("absent","present"))
table(data$y,data$site_name)
Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
absent 4 4 1 0 3 1 5 5 2
present 2 2 5 6 3 5 1 1 4
model <- glm(y~site_name,data=data,binomial)
Just skipping the model inference and validation for brevity's sake, how do I plot per site a probability of getting "present" in a boxplot with its confidence interval? What I would like is kind of what is shown in Plot predicted probabilities and confidence intervals in R but I would like to show it with a boxplot, as my regression variable site_name is a factor with 9 levels, not a continuous variable.
I think I can calculate the necessary values as follows (but am not 100% sure about the correctness):
Function to convert the model coefficients back to probabilities of success:
calc_val <- function(x){return(round(1/(1+1/(exp(x))),3))}
Predicted probabilities based on the model:
prob <- tapply(predict(model,type="response"),data$site_name,function(x){round(mean(x),3)})
means <- as.data.frame(prob)
75% and 95% confidence intervals for the predicted probabilities:
ci <- cbind(confint(model,level=0.9),confint(model,level=0.5))
rownames(ci) <- gsub("site_name","",rownames(ci))
ci <- t(apply(ci,1,calc_val))
Join it all together in one table
ci<-cbind(means,ci)
ci
prob 5 % 95 % 25 % 75 % Pr(>|z|) stderr
Andulay 0.333 0.091 0.663 0.214 0.469 0.42349216 0.192
Antulang 0.333 0.112 0.888 0.304 0.696 1.00000000 0.192
Basak 0.833 0.548 0.993 0.802 0.964 0.09916496 0.152
Dauin Poblacion District 1 1.000 0.000 NA 0.000 1.000 0.99097988 0.000
Guinsuan 0.500 0.223 0.940 0.474 0.819 0.56032414 0.204
Kookoo's Nest 0.833 0.548 0.993 0.802 0.964 0.09916496 0.152
Lutoban Pier 0.167 0.028 0.788 0.130 0.501 0.51171512 0.152
Lutoban South 0.167 0.028 0.788 0.130 0.501 0.51171512 0.152
Malatapay Pier 0.667 0.364 0.972 0.640 0.903 0.25767454 0.192
So my questions are twofold:
- Is the calculation of probability and confidence interval correct?
- How do I plot this in a bloxplot (box and whiskers plot)?
EDIT Here is some sample data via dput
(which also modified the tables above to match the data):
# dput(data[c("y", "site_name")])
data <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("absent", "present"), class = "factor"), site_name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 9L, 9L, 9L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("Andulay", "Antulang", "Basak", "Dauin Poblacion District 1", "Guinsuan", "Kookoo's Nest", "Lutoban Pier", "Lutoban South", "Malatapay Pier"), class = "factor")), .Names = c("y", "site_name"), row.names = c(125L, 123L, 126L, 124L, 128L, 127L, 154L, 159L, 157L, 158L, 156L, 155L, 111L, 114L, 116L, 115L, 112L, 113L, 152L, 151L, 148L, 150L, 153L, 149L, 143L, 146L, 144L, 147L, 142L, 145L, 164L, 165L, 161L, 163L, 160L, 162L, 120L, 122L, 121L, 117L, 118L, 119L, 137L, 136L, 139L, 141L, 140L, 138L, 129L, 134L, 131L, 135L, 133L, 130L), class = "data.frame")
#
dput(data[c("y", "site_name")])
please. (the hope is that people can copy your data from your question to their R session - we cant do this with the format you have posted) - user20650