3
votes

I have a large set of data points. I try to plot them with a boxplot, but some of the outliers are the exact same value and they are represented on a line beside each other. I found How to set the horizontal distance between outliers in gnuplot boxplot, but it doesn't help too much, as it is apparently not possible.

Is it possible to group the outliers together, print one point and then print a number in brackets beside it to indicate how many points there are? I think this would make it more readable in a graph.

For information, I have three boxplots for one x value and that times six in one graph. I am using gnuplot 5 and already played around with the pointsize, which doesn't reduce the distance anymore. I hope you can help!

Edit:

set terminal pdf
set output 'dat.pdf'
file0 = 'dat1.dat'
file1 = 'dat2.dat'
file2 = 'dat3.dat'
set pointsize 0.2
set notitle
set xlabel 'X'
set ylabel 'Y'
header = system('head -1 '.file0);
N = words(header)

set xtics ('' 1)
set for [i=1:N] xtics add (word(header, i) i)

set style data boxplot
plot file0 using (1-0.25):1:(0.2) with boxplot lw 2 lc rgb '#8B0000' fs pattern 16 title 'A'
plot file1 using (1):1:(0.2) with boxplot lw 2 lc rgb '#00008B' fs pattern 4 title 'B'
plot file2 using (1+0.25):1:(0.2) with boxplot lw 2 lc rgb '#006400' fs pattern 5 title 'C'
for [i=2:N] plot file0 using (i-0.25):i:(0.2) with boxplot lw 2 lc rgb '#8B0000' fs pattern 16 notitle
for [i=2:N] plot file1 using (i):i:(0.2) with boxplot lw 2 lc rgb '#00008B' fs pattern 4 notitle
for [i=2:N] plot file2 using (i+0.25):i:(0.2) with boxplot lw 2 lc rgb '#006400' fs pattern 5 notitle

What is the best way to implement it with this code already in place?

1

1 Answers

1
votes

There is not option to have this done automatically. Required steps to do this manually in gnuplot are:

(In the following I assume, that the data file data.dat has only a single column.)

  1. Analyze your data with stats to determine the boundaries for the outliers:

    stats 'data.dat' using 1
    range = 1.5 # (this is the default value of the `set style boxplot range` value)
    lower_limit = STATS_lo_quartile - range*(STATS_up_quartile - STATS_lo_quartile)
    upper_limit = STATS_up_quartile + range*(STATS_up_quartile - STATS_lo_quartile)
    
  2. Count only the outliers and write them to a temporary file

    set table 'tmp.dat'
    plot 'data.dat' using 1:($1 > upper_limit || $1 < lower_limit ? 1 : 0) smooth frequency
    unset table
    
  3. Plot the boxplot without the outliers, and the outliers with the labels plotting style:

    set style boxplot nooutliers
    plot 'data.dat' using (1):1 with boxplot,\
         'tmp.dat' using (1):($2 > 0 ? $1 : 1/0):(sprintf('(%d)', int($2))) with labels offset 1,0 left point pt 7
    

And this needs to be done for every single boxplot.

Disclaimer: This procedure should work basically, but having no example data I couldn't test it.