0
votes

I have a data frame like that called samples_type:

Status   variable        value
PAT       SPP1        1,994629e+00
PAT       SPP1        1,179033e+00
PAT       SPP1        2,901539e+00
PAT       SPP1        1,140857e+00
PAT       SPP1        1,467056e+00
PAT       SPP1        2,579037e+00

The "Status" column can take two values: PAT or CON. The "variable" column can take many values: SPP1, CCL24, ENG56 ...

I would like to make boxplots of values for each combination of Status:variable.

For the moment I have two codes:

boxplot(value ~ Status:variable, data=samples_type,
col=c("red", "limegreen"), las=2, outline=F)

and:

p0 <- ggplot(data = samples_J0_type, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Status)) +
facet_wrap( ~ variable, scales="free")

The first code give me all the boxplots in ONE GRAPH without outliers. I want to separate them as par(mfrow=c(...,...)) will do it. How can I do that?

With the second code I used ggplot2. I managed to separate the boxplots BUT as you can see, I didn't manage to delete the outliers and my boxplots are to small because of the outliers. How can I delete the outliers? I checked on StackOverFlow how to delete outliers with ggplot2, I found an answer for ONLY ONE boxplot but not for multiple boxplots. And I have no idea on how to do that...

EDIT: boxplots of each code

Boxplot with the first code Boxplot with the second code

2

2 Answers

3
votes

General

A bit difficult to help as you are not providing a minimal data set, so I have to fall back to existing data.

mt <- mtcars %>% select(cyl, mpg, am)
## add some outliers
mt <- rbind(mt, data.frame(cyl = c(4, 6, 8), mpg = rep(100, 3), am = 0))

Base R

You can split your data according to one of your variables, set the mfrow accordingly and use an apply function to generate each plot separately:

## split your data according to one variable
dl <- split(mt, mt$am)

## set the mfrow
par(mfrow = 1:2)
## something more educated would be something like this
## needs to be adapted for border cases
## par(mfrow = c(ceiling(sqrt(length(dl))), ceiling(sqrt(length(dl)))))

## loop through all data sets
lapply(dl, function(d) boxplot(mpg ~ cyl, data = d, outline = FALSE))

Boxplot

However, boxplot(. outliers = TRUE) does not really remove your outliers but extend your whiskers instead.

ggplot

For your second question, you can first hide your outliers via

geom_boxplot(aes(fill = Status), outlier.shape = NA)

and then adjust the y-range via ylim depending on your data.

Note. Technically, you do not need to use outlier.shape = NA becasue if you use ylim points which are outside the range will be dropped anyways, but it makes the code a bit more verbose to show what you want to do.

Example with a Builtin Dataset

library(tidyverse)

## plot w/ outliers shown
ggplot(mt, aes(x = factor(cyl), y = mpg)) + 
   geom_boxplot() + 
   facet_wrap(~am)

## plot with outliers removed
ggplot(mt, aes(x = factor(cyl), y = mpg)) + 
   geom_boxplot(outlier.shape = NA) + 
   facet_wrap(~am) + 
   ylim(c(0, 50))

Caveat

In your update you added the plots and I saw that you have free scales, which will render this approach useless, because you cannot specify the ylim on a per panel basis.

1
votes

Thank to @thothal, I post the final code which works:

dl = split(samples_type, samples_type$variable)
par(mfrow = c(ceiling(sqrt(length(dl))),ceiling(sqrt(length(dl)))))
iwalk(dl, ~ boxplot(value ~ Status , data = .x,
      outline = FALSE,col=c("red", "limegreen"), main=.y))

"iwalk" from the package "purrr".

To remember, head of the the data frame samples_type:

Status   variable        value
PAT       SPP1        1,994629e+00
PAT       SPP1        1,179033e+00
PAT       SPP1        2,901539e+00
PAT       SPP1        1,140857e+00
PAT       SPP1        1,467056e+00
PAT       SPP1        2,579037e+00

The "Status" column can take two values: PAT or CON. The "variable" column can take many values: SPP1, CCL24, ENG56 ...