I have about 8m rows of data with about 500 columns.
When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.
I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.
My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.
So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.
I know i can enable/disable dictionary, increase/decrease block- and pagesize.
But what would be a good configuration for my needs?
coaelsceand then merging parquet files? - Vladislav Varslavans