1
votes

I've read all the posts on R boxplots and dealing with outliers, and I can't simply delete/remove the outliers but my outliers are so high that my boxplots are essentially lines. I saw this post on a similar issue: https://stats.stackexchange.com/questions/114744/how-to-present-box-plot-with-an-extreme-outlier

But I don't know R well enough to even know what kind of code was used to make those plots.

Here is my example data that I've been trying to make look nice without hiding values.

Inhibitor   Trial2   Trial3
grak         0.20     0.45
grab        11.00    31.55
hhus         0.21     0.18
hhuf        0.341     0.32
kkul         1.66     0.80
kkju         0.45     0.30
juik         0.30     0.20
jtui         0.80     0.40
test         0.233     0.36


boxplot(df$Trial1, df$Trial2, ylab="Rate", xlab="Trial")

Here is what my boxplot looks like

I saw this post as well: https://stats.stackexchange.com/questions/63203/boxplot-equivalent-for-heavy-tailed-distributions and was trying to make this happen for my data but I have no idea how to make it work with more than 1 x value, and I get errors at almost every step of the way. The main error popping up is after I followed the very last example and I tried to create my boxplot.

Something like below:

enter image description here

I was trying to make this example graph too as this is an option that seems good as well (below):

enter image description here

I used this code but I got the following error:

df <- read.csv("Inhibitor.csv", header=TRUE)
xout <- boxplot(df$Trial1, df$Trail2, horizontal=TRUE)$out
xin <- df[!(df %in% xout)]
noutl1 <- sum(xout<median(df$Trial1))
noutl2 <- sum(xout<median(df$Trail2))
nouth1 <- sum(xout>median(df$Trial1))
nouth2 <- sum(xout>median(df$Trail2))
boxplot(xin, horizontal=TRUE, ylim=c(min(xin)*1.15, max(xin)*1.15))

Error in FUN(X[[i]], ...) : 
  only defined on a data frame with all numeric variables

I essentially want my main boxplots to be visually appealing (ylimits between 0 and 10), and then add a stripplot on top with ylimits between 10 and 30 and just the points of the outliers. I am open to suggestions if anyone has other means of showcasing data with extreme outliers. Thank you all!

2
Please post attempted code that raises that error.Parfait
Done, I added the code. I used the exact code from the link I supplied earlier to see if I could mimic it - could not.CuriousDude

2 Answers

0
votes

You are getting the error Error in FUN(X[[i]], ...) ... because you are applying the min and max functions on the entire xin. If you want to run the provided code without encountering this error, you must apply these functions to only the numeric columns of the xin data.frame with code like the following:

boxplot(xin$Trial1, horizontal=TRUE, ylim=c(min(c(xin$Trial1, xin$Trial2))*1.15, max(c(xin$Trial1, xin$Trial2))*1.15))

My preferred solution (assuming you need to use a boxplot and include all of the provided data) would be to transform the axis scale. The following code will plot Rate on the yaxis with a scale of base 2 (2^x).

library(ggplot2)
library(tidyr)
library(scales)

df <- data.frame(
  Inhibitor= c("grak", "grab", "hhus", "hhuf", "kkul", "kkju", "juik", "jtui", "test"),
  Trial2 = c(0.20, 11.00, 0.21, 0.341, 1.66, 0.45, 0.30, 0.80, 0.233),
  Trial3 = c(0.45, 31.55, 0.18, 0.32, 0.80, 0.31, 0.20, 0.40, 0.36)
)
#Gather the `Trial2` and `Trial3` columns to prepare for ggplot2


df2 <- gather(df, `Trial2`, `Trial3`, key="Trial", value = "Rate")

#plot with ggplot2

ggplot(data = df2, mapping = aes(x = Trial, y = Rate))+
  stat_boxplot(geom = 'errorbar')+
  geom_boxplot()+
  scale_y_continuous(trans = log2_trans())

Another option would be to create a boxplot with a broken axis with a library such as plotrix with code like the following:

library(plotrix)
gap.boxplot(df$Trial2, df$Trial3, gap=list(top=c(11.50, 31.00),bottom=c(NA,NA)))

The problem with using a broken axis with plotrix with this data is that the outliers are so extreme that a single broken axis will not generally produce a clean plot with both Trial2 and Trial3 in the same plot

0
votes

Link uses a vector of values while you attempt to run on an entire data frame. Consider reshape of your wide data frame to long format and then run your plot. Consider also boxplot.stats and avoid an unneeded plot output of boxplot:

rdf <- reshape(df, 
               varying=list(paste0("Trial", 1:2)), 
               v.names = "Trial",                
               times=paste0("Trial", 1:2), 
               timevar="Indicator",
               direction="long")

x <- rdf$Trial
xout <- boxplot.stats(x, coef=3)$out
xin <- rdf[!(rdf$Trial %in% xout),]
nouth <- sum(xout < median(xin$Trial))
noutl <- sum(xout > median(xin$Trial))
boxplot(Trial ~ Indicator, xin, horizontal=TRUE, 
        ylim=c(min(xin$Trial)*1.15, max(xin$Trial)*1.15))
text(x=max(xin$Trial)*1.17, y=1, labels=paste0(as.character(nouth)," >"))
text(x=min(xin$Trial)*1.17, y=1, labels=paste0("< ",as.character(noutl)))

Rextester Demo

Plot Output