3
votes

I plotted a simple data matrix

39  135 249 1   91  8   28  0   0   74  17  65  560
69  0   290 26  254 88  31  0   18  53  4   63  625
66  186 344 0   9   0   0   0   18  54  0   74  554
80  41  393 0   0   0   2   0   6   51  0   65  660
271 112 511 1   0   274 0   0   0   0   16  48  601
88  194 312 0   110 0   0   0   44  13  2   76  624
198 147 367 0   15  0   0   3   9   44  3   39  590

using a standard boxplot (i.e. where whiskers extend 1.5 x IRQ from Q1 and Q3). Each column is a variable, each row an observation.

Nevertheless I obtained two different graphics using R (RStudio 1.0.44) and Matlab2018. In particular, whiskers extend in a different way.

In Matlab I'm using the following code:

% clearing workspace
clear all;
close all;
clc;

%entering in current directory where I find the txt data file 
tmp = matlab.desktop.editor.getActive;
cd(fileparts(tmp.Filename));
clear tmp;

%reading data
df = readtable('pippo.txt', 'Delimiter', '\t', 'ReadVariableNames',false);
df = table2array(df)

figure(1);
boxplot(df(:, 1:end-1), 'Whisker', 1.5);
ylim([0 600]);

which produces the following graph: enter image description here

In R I'm using the following code:

rm(list = ls())

# getting the current directory
working_dir <-dirname(rstudioapi::getActiveDocumentContext()$path)

# setting the working directory where I finf the txt file with data
setwd(working_dir)

df <- read.table("pippo.txt")
jpeg('r_boxplot.jpg')
boxplot(df[,1:12], las=2, ylim=c(0,600), range=1.5)
dev.off()

which produces the following graph:

enter image description here

Observation 1: if I omit the parameters 'whiskers' and 'range' from both scripts I obtain the same graphics; it is expected as 1.5 seems to be the default whiskers value.

Observation 2: both matlab and R seem to read data in the correct way, I mean both workspaces visualise the same matrix

What Am I missing? Which graph should I trust?

2

2 Answers

2
votes

explanation for R boxplot code

MATLAB code for boxplots

So going through both functions I found that they both appear to be calculating the exact same thing even down to how they define the IQR

R claims to be doing the following for the boxplot

upper whisker = min(max(x), Q_3 + 1.5 * IQR)
lower whisker = max(min(x), Q_1 – 1.5 * IQR)
where IQR = Q_3 – Q_1, the box length.

MATLAB claims to be doing this for their boxplot

p75 + w(p75 – p25) 
p25 – w(p75 – p25)
where p25 and p75 are the 25th and 75th percentiles, respectively.

Even how they define whisker extension is the same with Matlab stating

%   The plotted whisker extends to the adjacent value, which is the most 
%   extreme data value that is not an outlier. Set whisker to 0 to give 
%   no whiskers and to make every point outside of p25 and p75 an outlier.

And R states

Range determines how far the plot whiskers extend out from the box. If range is 
positive, the whiskers extend to the most extreme data point which is no more than 
range times the interquartile range from the box. A value of zero causes the whiskers 
to extend to the data extremes.

Personally, I feel that it has to do with some underlying way the computations are performed. Edit After messing with the code, I can confirm it has everything to do with the underlying computations.

R code

quantile(a,c(.25, .75))
25% 75% 
301 380 
> 380+1.5*(380-301)
[1] 498.5
> 301-1.5*(380-301)
[1] 182.5

Matlab code

prctile(te,[25,75])
ans =

  295.5000  386.5000

W75 = p75 + 1.5*(p75-p25)
W25 = p25 - 1.5*(p75-p25)

W75 =

   523


W25 =

   159

I used the 3rd column of your data to test and see how the quantiles are being calculated. As you can see the 25% and 75% are not very different but just different enough to result in larger whisker cutoffs in the matlab code.

1
votes

From the MATLAB boxplot documentation:

On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the '+' symbol.

You likely want to check out the outlier computation.

Under the optional 'Whisker' input (default 1.5), you can see this explanation:

boxplot draws points as outliers if they are greater than q3 + w × (q3 – q1) or less than q1 – w × (q3 – q1), where w is the maximum whisker length, and q1 and q3 are the 25th and 75th percentiles of the sample data, respectively.

If you set the 'Whisker' option to 0.7, you get the same plot as seen in your R code:

boxplot(df(:, 1:end-1), 'Whisker', 0.7);

boxplot

The equivalent input for R's boxplot is range (docs):

Range: this determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.

This appears to be the same definition as shown above from the MATLAB docs - please refer to Hojo's answer for slightly more detail about the IQR computation.