2
votes

I'm looking to get the sample number to appear on each boxplot as I see here: https://python-graph-gallery.com/38-show-number-of-observation-on-boxplot/

I'm able to get the median and counts in lists as the link above presents. However, I have a factorplot with hue, such that the positions of the x-ticks don't seem to be captured on the x-axis.

Using the seaborn tips data set, I have the following:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

sns.set_style("whitegrid")
tips = sns.load_dataset("tips")

g = sns.factorplot(x="sex", y="total_bill",hue="smoker", 
col="time",data=tips, kind="box",size=4, aspect=.7)

# Calculate number of obs per group & median to position labels
medians = tips.groupby(['time','sex','smoker'])['total_bill'].median().values
nobs =  tips.groupby(['time','sex','smoker']).size()
nobs = [str(x) for x in nobs.tolist()]
nobs = ["n: " + i for i in nobs]


plt.show()

Here is the plot

I'd like to get the "n: [# of observations]" right above the median, and I'm wondering if there's a way to get that x-tick. Also, assume some groups don't always have both male and female so it can't just be hard coded.

1

1 Answers

3
votes

There are several tricky things going on here:

  1. You have two subaxes, one for each main plot. You need to iterate through these.

  2. You have multiple x-offset boxplots on each axis. You need to account for this.

  3. Once you know where you're drawing, you need to know which plot is being visualized there, since ordering ('Yes' first or 'No' first? 'Male' first or 'Female'?) isn't guaranteed.

Fortunately, if you keep your dataframe indexed (or, in this case, multi-indexed), you just need the text for the time, sex, and smoking to get to the correct value. These are all available with a little digging. The resulting code looks something like the following (note the changes to medians and nobs):

medians = tips.groupby(['time','sex','smoker'])['total_bill'].median()
nobs =  tips.groupby(['time','sex','smoker']).apply(lambda x: 'n: {}'.format(len(x)))

for ax in plt.gcf().axes:
    ax_time = ax.get_title().partition(' = ')[-1]

    for tick, label in enumerate(ax.get_xticklabels()):
        ax_sex = label.get_text()

        for j, ax_smoker in enumerate(ax.get_legend_handles_labels()[1]):
            x_offset = (j - 0.5) * 2/5
            med_val = medians[ax_time, ax_sex, ax_smoker]
            num = nobs[ax_time, ax_sex, ax_smoker]

            ax.text(tick + x_offset, med_val + 0.1, num,
                    horizontalalignment='center', size='x-small', color='w', weight='semibold')

Resulting plot

To verify, here is the nobs series:

time    sex     smoker
Lunch   Male    Yes       n: 13
                No        n: 20
        Female  Yes       n: 10
                No        n: 25
Dinner  Male    Yes       n: 47
                No        n: 77
        Female  Yes       n: 23
                No        n: 29