I have directory with small parquet files(600), I wanted to do ETL on those parquet and merge those parquet to 128mb each file. what is the optimal way to process the data.
should I read each file in the parquet directory and concat as a single data frame and do groupBY ? or provide parquet directory name to dd.read_parquet and process it?
I feel like, when I read the file by file, it creates a very large dask graph that cannot be fit as an image. I guess it will work with those many numbers of threads as well? which leads to a memory error.
which way is best to read the parquet file to process as dask dataframe ? file by file or provide entire directory ??