0
votes

I have directory with small parquet files(600), I wanted to do ETL on those parquet and merge those parquet to 128mb each file. what is the optimal way to process the data.

should I read each file in the parquet directory and concat as a single data frame and do groupBY ? or provide parquet directory name to dd.read_parquet and process it?

I feel like, when I read the file by file, it creates a very large dask graph that cannot be fit as an image. I guess it will work with those many numbers of threads as well? which leads to a memory error.

which way is best to read the parquet file to process as dask dataframe ? file by file or provide entire directory ??

1
PLease provide a mcve. - rpanai

1 Answers

0
votes

Unfortunately there is no single best eway to read a Parquet file for all situations. In order to properly answer the question you will need to know more about your situation.