3
votes

I have been learning Gnuplot for about a day now and I would like to use boxplot to spot outliers from a Data Set at a Glance.

So let us say I am conducting an experiment:

  • On 10 subjects
  • I make the 10 subjects repeat a task for a 100 times,to reach 3 specific targets.
  • I collect how many times they reach Target1, Target2, Target3.

Those result are gathered in the file data_File_new.dat described below:

    Name    Target1   Target2   Target3
    subject1    10  30  50
    subject2    11  31  51
    subject3    9   29  49
    subject4    12  32  52
    subject5    8   28  48
    subject6    13  33  53
    subject7    7   27  47
    subject8    50  34  54
    subject9    6   50  46
    subject10 15    35  20  

Now I create a boxplot from this data

   file = 'data_File_new.dat'
   header = system('head -1 '.file);
   N=words(header)
   set title 'BoxPlot Subject Success'
   set ylabel 'Number Of Success'
   set xtics border in scale 0,0 nomirror norotate  offset character 0, 0, 0 autojustify
   set xtics norangelimit
   set xtics rotate -45
   set xtics ('' 2)
   set for [i=2:N] xtics add (word(header, i) i)
   set style data boxplot
   plot for [i=2:N] file using (i):i

So the result is a boxplot with outliers being plotted as solid points (I wanted to post the picture but I need 10 reputation to post the image). It tells me whether there are outliers or not. However I want to know more I want to know who are the outliers, that is:

  • Subject 8 is an outlier for Target 1
  • Subject 9 is an outlier for Target 2
  • Subject 10 is an outlier for Target 3

Since Gnuplot knows these points are outliers, I expect Gnuplot to store them in some kind of list. I would like to tell Gnuplot 'plot the outliers and label them with the word of the first column (subjectx) corresponding to the line they belong to'.

Then when I open the boxplot I can identify at a glance not only there are outliers but also who they are.

Does anyone know how to do this? I looked on the forum and saw some people doing this in R but not in Gnuplot.

1
No, you cannot automatically label those outliers. The statistical calculations are done internally and you cannot 'attach' a label to any of the results. You could only later, if you know the values, use the labels plotting style for some labelling.Christoph
Thank you. I am going to look for a solution in R.Robin74
@Christoph in case you're interested, I have posted a solution.Tom Fenech
It works. Thanks a lot for your expertise Tom, it is very helpful.Robin74

1 Answers

2
votes

It's not the prettiest bit of gnuplot code but it can be done!

Gnuplot stats can be used to obtain the upper and lower quartile, which are used to produce the boxplot. You can then use some conditional code to plot the points that lie outside the range with labels. The tricky part is that the plot command is built up as string, before being evaled at the end. Like I said, not too pretty!

file = 'data_File_new.dat'
header = system('head -1 '.file)
N=words(header)
set title 'BoxPlot Subject Success'
set ylabel 'Number Of Success'
set xtics border in scale 0,0 nomirror norotate  offset character 0, 0, 0 autojustify
set xtics norangelimit
set xtics rotate -45
set xtics ('' 2)
set for [i=2:N] xtics add (word(header, i) i)
r = 1.5
set style boxplot range r
unset key
cmd = "plot for [i=2:N] file using (i):i with boxplot"
do for [i=2:N] {
    stats file using i every ::1 nooutput
    lq = STATS_lo_quartile
    uq = STATS_up_quartile
    ir = uq - lq
    min = lq - r * ir
    max = uq + r * ir
    cmd = cmd . sprintf(", file using (%d):($%d < %d || $%d > %d ? $%d : 1/0):1 every ::1 with labels offset 5,0", i, i, min, i, max, i)
}
eval cmd

final plot with labeled outliers