I have a Spark dataframe in Python and it is in a specific order where the rows can be sectioned into the right groups, according to a column "start_of_section" which has values 1 or 0. For each collection of rows that need to be grouped together, every column other than "value" and "start_of_section" is equal. I want to group each such collection into one row that has the same values for every other column and a column "list_values" which has an array of all the values in each row.
So some rows might look like:
Row(category=fruit, object=apple, value=60, start_of_section=1)
Row(category=fruit, object=apple, value=160, start_of_section=0)
Row(category=fruit, object=apple, value=30, start_of_section=0)
and in the new dataframe this would be
Row(category=fruit, object=apple, list_values=[60, 160, 30])
(Edit: note that the column "start_of_section" should not have been included in the final dataframe.)
The issue I've had in trying to research the answer is that I've only found ways of grouping by column value without regard for ordering, so that this would wrongly produce two rows, one grouping all rows with "start_of_section"=1 and one grouping all rows with "start_of_section"=0..
What code can achieve this?