337
votes

I am using pandas as a db substitute as I have multiple databases (oracle, mssql, etc) and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, .... etc etc

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in pandas?

10
I have done table.groupby(['YEARMONTH'])['CLIENTCODE'].unique() and came with two series indexed by YEARMONTH and with all the unique values. How to count the amount of values on each series?Adriano Almeida
For some, value_counts might be the answer you are looking for: pandas.pydata.org/pandas-docs/stable/generated/…sachinruk

10 Answers

506
votes

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3
117
votes

Here is another method, much simple, lets say your dataframe name is daat and column name is YEARMONTH

daat.YEARMONTH.value_counts()
51
votes

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

12
votes

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count
5
votes

Using crosstab, this will return more information than groupby nunique

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]: 
CLIENTCODE  1  2  3
YEARMONTH          
201301      2  1  0
201302      1  2  1

After a little bit modify ,yield the result

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]: 
YEARMONTH
201301    2
201302    3
dtype: int64
4
votes

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column (CLIENTCODE in your case), we can use nunique. We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)
0
votes

With new pandas version, it is easy to get as dataframe

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE =('CLIENTCODE',pd.Series.count))
0
votes

Here an approach to have count distinct over multiple columns. Let's have some data:

data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
        'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
        'PRODUCT_CODE': [100,150,220,400,50,80,100]
       }
table = pd.DataFrame(data)
table

CLIENT_CODE YEAR_MONTH  PRODUCT_CODE
0   1       201301      100
1   1       201301      150
2   2       201301      220
3   1       201302      400
4   2       201302      50
5   2       201302      80
6   3       201302      100

Now, list the columns of interest and use groupby in a slightly modified syntax:

columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()

We obtain:

YEAR_MONTH  PRODUCT_CODE CLIENT_CODE        
1           2            3
2           2            3
3           1            1
0
votes

Now you are also able to use dplyr syntax in python to do it:

>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>> 
>>> data = tibble(
...     CLIENT_CODE=[1,1,2,1,2,2,3],
...     YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>> 
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
   YEAR_MONTH       n
      <int64> <int64>
0      201301       2
1      201302       3

I am the author of the datar package. Feel free to submit issues if you have any questions.

0
votes

create an pivot table and use the nunique series function

ID=[ 123, 123, 123, 456, 456, 456, 456, 789, 789] 
domain=['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
'vk.com', 'google.com', 'twitter.com', 'vk.com']
df=pd.DataFrame({'id':ID,'domain':domain})
fp=pd.pivot_table(data=df,index='domain',aggfunc=pd.Series.nunique)
print(fp)

output:

               id
domain          
facebook.com   1
google.com     1
twitter.com    2
vk.com         3