Creating Proc Frequency tables in Python with count and percentage

Question

I'm relatively new to Python but made some solid progress however i've come unstuck with converting my process in SAS utilising Proc Freq one way frequency tables into python.

A scaled down version of my dataset is as follows:

import pandas as pd

data = {'Code':  [719, 719,719,719,719,719],
        'National': [1001, 1001,1001,1001,1001,1001],
        'Regional': [3005,3005,3005,3005,3005,3005],
         'Local': [2002,2002,2002,2002,2002,2002],
        'Collection Variable': [1,1,2,3,3,3]}


df = pd.DataFrame (data, columns = ['Code','National', 'Regional', 'Local','Collection Variable'])

Outputs in SAS would take this and add a frequency and percentage column for the collection variable columns, results as below:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Is anything like this possible in python? I've managed to get the percentages, but running in separate columns but not able to get the additional columns.

Furthermore, this would happen for a number of different Collection Variables. In SAS it creates a new table for each variable, is that possible in python with some kind of loop?

Thanks for any help.

Sahith Kurapati Sahith Kurapati · Accepted Answer · 2020-06-09T11:27:53

These 2 lines should be enough to get the Count and Percentage columns (it also accounts for nan values):

df = df = df.astype(str).groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'Count'})
df['Percentage'] = df[df['Collection Variable'] != 'nan']['Count'].apply(lambda x: x/sum(df[df['Collection Variable'] != 'nan']['Count']))

Output:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Hope that helps :)

Creating Proc Frequency tables in Python with count and percentage

1 Answers