2
votes

I'm very new to PySpark. I was building a tfidf and want to store it in disk as an intermediate result. Now the IDF scoring gives me a SparseVector representation.

However when trying to save it as Parquet, I'm getting OOM. I'm not sure if it internally converts the SparseVector to Dense as in that case it will lead to some 25k columns and according to this thread, saving such big data in columnar format can lead to OOM.

So, any idea on what can be the case? I'm having executor memory as 8g and operating on a 2g CSV file.

Should I try increasing the memory or save it in CSV instead of Parquet? Any help is appreciated. Thanks in advance.

Update 1

As pointed out, that Spark performs lazy evaluation, the error can be because of an upstream stage, I tried a show and a collect before the write. They seemed to run fine without throwing errors. So, is it still some issue related to Parquet or I need some other debugging?

1

1 Answers

5
votes

Parquet doesn't provide native support for Spark ML / MLlib Vectors and neither are these first class citizens in Spark SQL.

Instead, Spark represents Vectors using struct fields with three fields:

  • type - ByteType
  • size - IntegerType (optional, only for SparseVectors)
  • indices - ArrayType(IntegerType) (optional, only for SparseVectors)
  • values - ArrayType(DoubleType)

and uses metadata to distinguish these from plain structs and UDT wrappers to map back to external types. No conversion between sparse and dense representation is needed. Nonetheless, depending on the data, such representation might require comparable memory, to the full dense array.

Please note that that OOM on write are not necessarily related to the writing process itself. Since Spark is in general lazy, the exception can be caused by any of the upstream stages.