3
votes

I have a large .csv file with roughly 150M rows. I can still fit the entire data set into memory and use Pandas to groupby and combine. Example...

aggregated_df = df.groupby(["business_partner", "contract_account"]).sum()

In the above example the dataframe contains two integer columns, business_partner and contract_account, which are used as keys for the grouping operation. The remaining columns can be assumed all be floating point features which I would like to aggregate.

However this only uses 1 of the 48 cores on my workstation. I am trying to use vaex in order to take advantage of all of my cores but cannot figure out the API calls to perform groupby and combine. Perhaps it is not yet possible in Vaex?

Edit(s):

  1. I am aware that this operation can be done in dask, but for this question I want to focus on Vaex.
2
What does your data look like ? - vlemaistre
@davidrpugh would you perfer to use dask for this operation. It has the same api of that of pandas (built on top of pandas in way). from dask.distributed import Client; client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB') ;df.goupby(["col1", "col2"]).sum().compute() - Nithin Varghese

2 Answers

8
votes

You can find a working example in https://docs.vaex.io/en/latest/api.html#vaex.dataframe.DataFrameLocal.groupby

Going with your example of grouping by 2 columns and getting a sum aggregation:

import pandas as pd, numpy as np
import vaex

# Create input dataframe

n=10**6  # Change this to adjust df size

a_c1 = [1,2,3]*n
a_c2 = [1,1,2,2,3,3]*int(n/2)
a_x = np.arange(float(len(a_c1)))
df = pd.DataFrame({'c1':a_c1,'c2':a_c2,'x1':a_x, 'x2':a_x})

# Convert dataframe to vaex

# dfv = vaex.from_pandas(df) # This also works, but it's slower
dfv = vaex.from_arrays(c1=a_c1, c2=a_c2, x1=a_x, x2=a_x)

df_result1 = df.groupby(['c1','c2']).sum()
df_result2 = dfv.groupby(['c1','c2'],agg='sum')

The output structure will be slightly different:

> print(df_result1)

                 x1            x2
c1 c2                            
1  1   7.499985e+11  7.499985e+11
   2   7.500000e+11  7.500000e+11
2  1   7.499990e+11  7.499990e+11
   3   7.500005e+11  7.500005e+11
3  2   7.499995e+11  7.499995e+11
   3   7.500010e+11  7.500010e+11

> print(df_result2)

  #    c1    c2      x_1_sum      x_2_sum
  0     2     3  7.50000e+11  7.50000e+11
  1     2     1  7.49999e+11  7.49999e+11
  2     3     2  7.5e+11      7.5e+11
  3     3     3  7.50001e+11  7.50001e+11
  4     1     2  7.5e+11      7.5e+11
  5     1     1  7.49998e+11  7.49998e+11

0
votes

As far as I know, you will have to adjust the limits and number of bins for the grouping manually - but the 'binby' argument should do the job in vaex:

df.sum([list of columns you want summed],binby=["business_partner", "contract_account"],limits=['minmax','minmax'],
       shape=[business_partner_bins,contract_account_bins])

if you want all columns summed you could replace [list of columns you want summed] by df.column_names

PS. I just noted there is a groupby function in vaex, but I have no experience using it. https://vaex.readthedocs.io/en/latest/api.html?highlight=groupby#vaex.dataframe.DataFrameLocal.groupby