3
votes

To quickly visualize the differences between measurements, I want to use gnuplot to draw two (later multiple) boxplots combined in a single plot. Basically I want to visualize the Five-number-summary (Min. 1st Qu. Median Mean 3rd Qu. Max.) of each measurement.
Each column in my 'datafile' represents samples of a measurement.
My data is in this form:

    A      B    C D
    1.008 1.008 . .
    0.909 0.909 . .
    0.975 0.975
    2.647 2.647
    6.530 1.901
    1.819 0.909
    1.819 0.909
    2.695 0.909
    0.529 0.529
    0.964 0.964
    2.728 0.909
    1.819 0.909
    4.133 1.108
   11.275 6.133
    5.920 5.920
        .     .

and I would like it to look like the boxplot demo.
However I cannot get the demo to work since they seem to use a third column to slide one boxplot to the right, but I do not really understand how that works.
For clarification, in R I would do something like this:

    par(mfrow=c(1,3))
    b1 <- boxplot(datafile$A)
    b2 <- boxplot(datafile$B)
    b3 <- boxplot(datafile$C)

I'm also wondering how I can plot the boxplots on the same scale. I'm worried that the few really high values might stretch the max. whiskers of the boxplot so much that the box itself becomes too tiny for me to see differences between the medians of the two boxes.


Edit:
The suggested solution was ok until I tried to also plot the rest of my data. If I plot my data the plots become so crowded that it's impossible to see something.
Below is an example with only the first 1000 entries of the rest of my data.
enter image description here

How can I include the outliers into the boxes? (I do not want to discard them.)

2

2 Answers

3
votes

In the examples they use a fixed number to set each boxplot:

plot 'data.txt' using (0):1 with boxplot

plots the data in the first column placed at the x-value 0. For two plots it is accordingly:

set style data boxplot
plot 'data.txt' using (0):1, '' using (1):2

Gnuplot cannot determine automatically the number of columns, but you can achieve some kind of automatization as follows:

file = 'data.txt'
header = system('head -1 '.file);
N = words(header)

set xtics ('' 1)
set for [i=1:N] xtics add (word(header, i) i)

set style data boxplot
unset key
plot for [i=1:N] file using (i):i

If I duplicate the two columns you showed, and label them with A B C D, I get the following plot with gnuplot 4.6.3:

enter image description here

As you see, outliers aren't taken into account. To hide the outliers, use set style boxplot nooutliers.

0
votes

I had the same issue and found out the reason for it. If you have the value for an outlier multiple times in your data set, then gnuplot will plot them in a line, resulting in a graph similar to what you have shown.

Apparently you can't avoid it or suppress the additional values. What you can do is tell gnuplot to use the whiskers in such a way that they mark the maximum and minimum value too. According to Wikipedia this is one alternative to use whiskers. I don't know if it fits for your plots, but it resolves the issue by circumventing it.

I'm not sure if I could help you, but maybe somebody who comes across this finds it useful or can even propose a way to remove the additional points for an outlier.