0
votes

I have a dataset that looks something like this

status age_group
failure 18-25
failure 26-30
failure 18-25
success 41-50

and so on...

sns.countplot(y='status', hue='age_group', data=data)

When i countplot the full dataset I get this dataset countplot hued by age_group

The question is the following, how do I plot a graph that is adjusted by the n of occurences of each age_group directly with seaborn? because without it, the graph is really misleading, as for example, the >60 age group appears the most simply because it has more persons within that age_group. I searched the documentation but it does not have any built-in function for this case.

Thanks in advance.

1

1 Answers

1
votes

The easiest way to show the proportions, is via sns.histogram(..., multiple='fill'). To force an order for the age groups and the status, creating ordered categories can help.

Here is some example code, tested with seaborn 0.11.1:

import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
import pandas as pd

data = pd.DataFrame({'status': np.random.choice(['Success', 'Failure'], 100, p=[.7, .3]),
                     'age_group': np.random.choice(['18-45', '45-60', '> 60'], 100, p=[.2, .3, .5])})
data['age_group'] = pd.Categorical(data['age_group'], ordered=True, categories=['18-45', '45-60', '> 60'])
data['status'] = pd.Categorical(data['status'], ordered=True, categories=['Failure', 'Success'])
ax = sns.histplot(y='age_group', hue='status', multiple='fill', data=data)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
plt.show()

histplot with multiple='fill'

Now, to create the exact plot of the question, some pandas manupulations might create the following dataframe:

  • count the values for each age group and status
  • divide these by the total for each age group

Probably some shortcuts can be taken, but this is how I tried to juggle with pandas (edit from comment by @PatrickFitzGerald: using pd.crosstab()):

# df = data.groupby(['status', 'age_group']).agg(len).reset_index(level=0) \
#     .pivot(columns='status').droplevel(level=0, axis=1)
# totals = df.sum(axis=1)
# df['Success'] /= totals
# df['Failure'] /= totals
df = pd.crosstab(data['age_group'], data['status'], normalize='index')
df1 = df.melt(var_name='status', value_name='percentage', ignore_index=False).reset_index()
ax = sns.barplot(y='status', x='percentage', hue='age_group', palette='rocket', data=df1)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel('Percentage')
ax.set_ylabel('')
plt.show()

procentual barplot