Groupby and combine a dataframe using Vaex

Question

I have a large .csv file with roughly 150M rows. I can still fit the entire data set into memory and use Pandas to groupby and combine. Example...

aggregated_df = df.groupby(["business_partner", "contract_account"]).sum()

In the above example the dataframe contains two integer columns, business_partner and contract_account, which are used as keys for the grouping operation. The remaining columns can be assumed all be floating point features which I would like to aggregate.

However this only uses 1 of the 48 cores on my workstation. I am trying to use vaex in order to take advantage of all of my cores but cannot figure out the API calls to perform groupby and combine. Perhaps it is not yet possible in Vaex?

Edit(s):

I am aware that this operation can be done in dask, but for this question I want to focus on Vaex.

@davidrpugh would you perfer to use dask for this operation. It has the same api of that of pandas (built on top of pandas in way). from dask.distributed import Client; client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB') ;df.goupby(["col1", "col2"]).sum().compute() — Nithin Varghese

capelastegui capelastegui · Accepted Answer · 2019-07-15T15:30:22

You can find a working example in https://docs.vaex.io/en/latest/api.html#vaex.dataframe.DataFrameLocal.groupby

Going with your example of grouping by 2 columns and getting a sum aggregation:

import pandas as pd, numpy as np
import vaex

# Create input dataframe

n=10**6  # Change this to adjust df size

a_c1 = [1,2,3]*n
a_c2 = [1,1,2,2,3,3]*int(n/2)
a_x = np.arange(float(len(a_c1)))
df = pd.DataFrame({'c1':a_c1,'c2':a_c2,'x1':a_x, 'x2':a_x})

# Convert dataframe to vaex

# dfv = vaex.from_pandas(df) # This also works, but it's slower
dfv = vaex.from_arrays(c1=a_c1, c2=a_c2, x1=a_x, x2=a_x)

df_result1 = df.groupby(['c1','c2']).sum()
df_result2 = dfv.groupby(['c1','c2'],agg='sum')

The output structure will be slightly different:

> print(df_result1)

                 x1            x2
c1 c2                            
1  1   7.499985e+11  7.499985e+11
   2   7.500000e+11  7.500000e+11
2  1   7.499990e+11  7.499990e+11
   3   7.500005e+11  7.500005e+11
3  2   7.499995e+11  7.499995e+11
   3   7.500010e+11  7.500010e+11

> print(df_result2)

  #    c1    c2      x_1_sum      x_2_sum
  0     2     3  7.50000e+11  7.50000e+11
  1     2     1  7.49999e+11  7.49999e+11
  2     3     2  7.5e+11      7.5e+11
  3     3     3  7.50001e+11  7.50001e+11
  4     1     2  7.5e+11      7.5e+11
  5     1     1  7.49998e+11  7.49998e+11

Groupby and combine a dataframe using Vaex

2 Answers