which way is best to read the parquet file to process as dask dataframe

Question

I have directory with small parquet files(600), I wanted to do ETL on those parquet and merge those parquet to 128mb each file. what is the optimal way to process the data.

should I read each file in the parquet directory and concat as a single data frame and do groupBY ? or provide parquet directory name to dd.read_parquet and process it?

I feel like, when I read the file by file, it creates a very large dask graph that cannot be fit as an image. I guess it will work with those many numbers of threads as well? which leads to a memory error.

which way is best to read the parquet file to process as dask dataframe ? file by file or provide entire directory ??

MRocklin MRocklin · Accepted Answer · 2020-05-23T17:41:12

Unfortunately there is no single best eway to read a Parquet file for all situations. In order to properly answer the question you will need to know more about your situation.

which way is best to read the parquet file to process as dask dataframe

1 Answers