Pyspark: OutOfMemoryError on count()

Question

This is a follow-on query to my previous one: following suggestion there I got the row-over-row percentage changes and since the first row in the df_diff dataframe (df) was all null values, I did:

df_diff = df_diff.dropna()
df_diff.count()

The second statement throws the following error:

Py4JJavaError: An error occurred while calling o1844.count.
: java.lang.OutOfMemoryError: Java heap space

When I try above code on the toy df posted in the previous post it works fine but with my actual dataframe (834 rows, 51 columns) the above error happens. Any guidance as to why this is happening and how to handle it would be much appreciated. Thanks

EDIT:

In my actual dataframe (df) of 834 X 51, the first column is date and the remaining columns are closing stock prices for 50 stocks for which I'm trying to get the daily percentage changes. Partitioning the window by the date col made no difference to the previous error in this df in pyspark and there doesn't seem to any other natural candidate to partition by.

The only thing that sort of worked was to do this in spark-shell. Here without partitions I was getting warning messages ...

WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

... until I called cache() on the dataframe but this is not ideal for large df

user9892299 user9892299 · Accepted Answer · 2018-06-04T12:52:09

Your original code is simply not scalable. Following

w = Window.orderBy("index")

window definition requires shuffling data to a single partition, and such is useful only for small, local datasets.

Depending on the data, you can try more complex approach, like the one show in Avoid performance impact of a single partition mode in Spark window functions

Pyspark: OutOfMemoryError on count()

1 Answers