0
votes

I have about 8m rows of data with about 500 columns. When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.

I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.

My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.

So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.

I know i can enable/disable dictionary, increase/decrease block- and pagesize.

But what would be a good configuration for my needs?

1
But why do you need single file? Generally it would be bad practice, since you just kill all parallelism with that. - Vladislav Varslavans
That is ok, I can live with longer runtime on write. I don't want to many read requests when I query the table. Not repartitioning would result in a lot of small files, which is also bad practice. Since parquet is splittable, i have parallelism on read - Joha
Have you also considered writing without coaelsce and then merging parquet files? - Vladislav Varslavans
Yes, we had that before, but we wanted to do it in a single step. - Joha
So my question is really on the parquet configuration - Joha

1 Answers

0
votes

I don't think that Parquet is really contributes to failure here and tweaking its configuration probably won't help.

coalesce(1) is a drastic operation that affect all upstream code. As a result, all processing is done on a single node, and according to your own words, your resources are already very limited.

You didn't provide any information about the rest of the pipeline, but if you want to stay with Spark, your best hope is replacing coalesce with repartition. If OOM occurs in one of the preceding operations it might help.