0
votes

In our datasets, we have a few absolutely huge outliers. If we plot (eg in a boxplot) and include the outliers, the axis will be so squeezed that it's useless. Log-scaling doesn't help. But we want to tell the reader that the outliers exist (and say how many, and on which side of the boxplot, positive or negative), preferably without adding text manually to the caption. Is there a good method for this? Preferably in R, Matplotlib or Seaborn.

This is different from eg Ignore outliers in ggplot2 boxplot because I don't want to ignore the outliers: I want to show that they exist, but not plot them.

Sample code:

# from https://stackguides.com/questions/5677885/ignore-outliers-in-ggplot2-boxplot
> library("ggplot")
> df = data.frame(y = c(-100, rnorm(100), 100))
> ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))

We see a boxplot that is useless because of the presence of outliers. If we follow the accepted answer at that link, we remove the outliers in a very nice way, but now the reader doesn't realise there were any outliers.

EDIT a couple of comments/answers ask what I actually want, but that is precisely the difficulty -- I know I want an automated graphical presentation of the outliers (together with the main data), but I don't know what this should look like, exactly. I hope someone in the community knows some best practice for this situation. I don't need help writing code to find outliers or add text to plots.

1
Could you post some sample data and a bit of basic code to visualise the problem?Thomas Kühn
Possible duplicate of Ignore outliers in ggplot2 boxplotWimpel
@ThomasKühn, added code.jmmcd
@Wimpel, not a duplicate as described in edit.jmmcd
I don't think there is an established convention. An idea: draw all the outliers in, say, red and place a red arrow near the border to signal an out-of-bonds outlier, optionally place its value aside the arrow, in fine red printgboffi

1 Answers

0
votes

The base function boxplot.stats() is what you need. See the help function for details on how outliers are identified. Here's one way to find and report on the presence of outliers.

  set.seed(123) # make reproducible
  y <- c(rnorm(3, -100), rnorm(3, 100), rnorm(100, 1))
  y <- sample(y) # mix 'em up
  out <- boxplot.stats(y)$out # find outliers
  lo <- out[out < median(y)] # collect low
  hi <- out[out > median(y)] # collect high
  sel.lo <- which(y %in% lo) # collect positions of low
  sel.hi <- which(y %in% hi) # collect positions of high

# Report on what was found
  sprintf("%d low outliers and %d high outliers found",
    length(lo), length(hi))
# [1] "3 low outliers and 3 high outliers found"

You could replace the values identified by sel.lo and sel.hi with placeholders at a more reasonable distance for plotting purposes. Of course changing the data and reapplying boxplot would likely change the statistics and change the definition of outliers.

The plot scale can be set with the values from boxplot.stats if preserving the original boxplot properties but without the outlier influence is important.

  ylim <- 1.1 * boxplot.stats(y)$stats[c(1, 5)] # ends of the whiskers
  par(mfrow = c(1,2), las = 2, mar = c(1, 4, 3, 1))
  boxplot(y, main = "All data")
  boxplot(y, ylim = ylim, main = "Outliers ignored")

boxplot examples