3
votes

From https://en.wikipedia.org/wiki/Box_plot

The whisker of the box plot has the following possible definitions:

  • the minimum and maximum of all of the data[1]
  • the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile
  • one standard deviation above and below the mean of the data
  • the 9th percentile and the 91st percentile
  • the 2nd percentile and the 98th percentile.

I am wondering in the pandas:

df['data'].plot(kind = 'box',  sym='bD')

which definition is the whisker using?

Also, for the matplotlib library:

ax.boxplot(dfa.duration)

which definition is the whisker using?

Thanks!

1

1 Answers

7
votes

The boxplot documentaton says about the whiskers

whis : float, sequence, or string (default = 1.5)

As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whisIQR). Similarly, the lower whisker will extend to the first datum greater than Q1 - whisIQR. Beyond the whiskers, data are considered outliers and are plotted as individual points. Set this to an unreasonably high value to force the whiskers to show the min and max values. Alternatively, set this to an ascending sequence of percentile (e.g., [5, 95]) to set the whiskers at specific percentiles of the data. Finally, whis can be the string 'range' to force the whiskers to the min and max of the data.

The only definition from the list from the question which cannot be easily implemented is the "one standard deviation", all others are readily set with this argument. The default is the 1.5IQR definition.

The pandas.DataFrame.boxplot calls the matplotlib function. Hence they should be identical.