I have a PySpark DataFrame, df1, that looks like:
CustomerID CustomerValue
12 .17
14 .15
14 .25
17 .50
17 .01
17 .35
I have a second PySpark DataFrame, df2, that is df1 grouped by CustomerID and aggregated by the sum function. It looks like this:
CustomerID CustomerValueSum
12 .17
14 .40
17 .86
I want to add a third column to df1 that is df1['CustomerValue'] divided by df2['CustomerValueSum'] for the same CustomerIDs. This would look like:
CustomerID CustomerValue NormalizedCustomerValue
12 .17 1.00
14 .15 .38
14 .25 .62
17 .50 .58
17 .01 .01
17 .35 .41
In other words, I'm trying to convert this Python/Pandas code to PySpark:
normalized_list = []
for idx, row in df1.iterrows():
(
normalized_list
.append(
row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
)
)
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]
How can I do this?