0
votes

I'm relatively new to Python but made some solid progress however i've come unstuck with converting my process in SAS utilising Proc Freq one way frequency tables into python.

A scaled down version of my dataset is as follows:

import pandas as pd

data = {'Code':  [719, 719,719,719,719,719],
        'National': [1001, 1001,1001,1001,1001,1001],
        'Regional': [3005,3005,3005,3005,3005,3005],
         'Local': [2002,2002,2002,2002,2002,2002],
        'Collection Variable': [1,1,2,3,3,3]}


df = pd.DataFrame (data, columns = ['Code','National', 'Regional', 'Local','Collection Variable'])

Outputs in SAS would take this and add a frequency and percentage column for the collection variable columns, results as below:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Is anything like this possible in python? I've managed to get the percentages, but running in separate columns but not able to get the additional columns.

Furthermore, this would happen for a number of different Collection Variables. In SAS it creates a new table for each variable, is that possible in python with some kind of loop?

Thanks for any help.

1

1 Answers

0
votes

These 2 lines should be enough to get the Count and Percentage columns (it also accounts for nan values):

df = df = df.astype(str).groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'Count'})
df['Percentage'] = df[df['Collection Variable'] != 'nan']['Count'].apply(lambda x: x/sum(df[df['Collection Variable'] != 'nan']['Count']))

Output:

National  Regional  Local  Code  Collection  Count  Percent
                                  Variable   

  1001      3005    2002   719       1         2      0.333333
  1001      3005    2002   719       2         1      0.166667
  1001      3005    2002   719       3         3      0.5

Hope that helps :)