I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)