4
votes

I am trying to replicate the very useful tabulate twoway function in Stata, where on the same output you can see the frequency, the corresponding row percentage and the corresponding column percentage for each possible combination of two categorical variables in the data.

That is, each cell in the resulting table contains three stacked values corresponding to the above numbers.

Example output in Stata: this is the tabulation of a single variable car type, where, in the data, i have 52 observations where car type is domestic and 22 observations where car type equals foreign.

 Domestic   Foreign  Total

 52         22       74 
 70.27      29.73    100.00 
 100.00     100.00   100.00 

Is it possible to do that in Pandas? Example:

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
                   'B' : ['A', 'B', 'C'] * 8,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
                   'D' : np.random.randn(24),
                   'E' : np.random.randn(24)})

something that would combine

pd.crosstab(df.A,df.B)

with pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)

and pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=0)

on the same output. What do you think? I am thinking about aggfunc but here I need to get the whole dataframe, not just a Series...

Thanks!

1

1 Answers

2
votes

Consider concatenating all three into one df, using crosstab's margins. Below uses your example:

newdf = pd.concat([pd.crosstab(df.A, df.B, margins=True),
                   pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=0),                   
                   pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)])

newdf = newdf.loc[['one','two','three','All']]              # RE-ORDER INDEX VALUES                                 
newdf = newdf[['A', 'B', 'C', 'All']]                       # RE-ORDER COLUMNS 
newdf['All'] = np.where(pd.isnull(newdf['All'])   &         # ROW PERCENTAGES SUM
                        (newdf['A'] + newdf['B'] + newdf['C']==1),
                        1, newdf['All'])
print(newdf)
#              A         B         C   All
# A                                        
# one    4.000000  4.000000  4.000000  12.0
# one    0.500000  0.500000  0.500000   NaN
# one    0.333333  0.333333  0.333333   1.0
# two    2.000000  2.000000  2.000000   6.0
# two    0.250000  0.250000  0.250000   NaN
# two    0.333333  0.333333  0.333333   1.0
# three  2.000000  2.000000  2.000000   6.0
# three  0.250000  0.250000  0.250000   NaN
# three  0.333333  0.333333  0.333333   1.0
# All    8.000000  8.000000  8.000000  24.0