I'm very new to PySpark. I was building a tfidf and want to store it in disk as an intermediate result. Now the IDF scoring gives me a SparseVector representation.
However when trying to save it as Parquet, I'm getting OOM. I'm not sure if it internally converts the SparseVector to Dense as in that case it will lead to some 25k columns and according to this thread, saving such big data in columnar format can lead to OOM.
So, any idea on what can be the case? I'm having executor memory as 8g and operating on a 2g CSV file.
Should I try increasing the memory or save it in CSV instead of Parquet? Any help is appreciated. Thanks in advance.
Update 1
As pointed out, that Spark performs lazy evaluation, the error can be because of an upstream stage, I tried a show and a collect before the write. They seemed to run fine without throwing errors. So, is it still some issue related to Parquet or I need some other debugging?