My spark (pyspark) ETL using a window function has stopped working. I wonder if it is skewness in the data. The Window does something like
windowSpec = Window.partitionBy('user').orderBy('time').rowsBetween(1, 1)
next_time = F.lead('time', 1).over(windowSpec)
What if the data has some outlier users with lots of data? When spark partitions by user to do the window, I imagine I could get a partition which is too big - I am seeing only two of the many jobs fail (job may be the wrong terminology).
How do I check this? I know I can do a df.groupBy('user').count() and look for outlier users, but how do I see how big the partitions the Window function needs will be? I'm hoping spark will automatically put a few big users in one partition, and a lot of small users in others.