My work environment mainly use PySpark, but doing some Googling, it is very complex to transpose in PySpark. I would like to keep it in PySpark but if it's much easier to do it in Pandas, I will convert the Spark dataframe to a Pandas dataframe. The dataset isn't so big where performance is an issue I would think.
I would like to transform a dataframe with multiple columns into rows:
Input:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective Normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive Normal Normal Effective
3 General Hospital 994 Random Street 923 Normal Effective Normal Effective
Output:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
Upon looking at PySpark examples, it is complicated: PySpark Dataframe melt columns into rows
And looking at Pandas example, it looks much easier. But there are many different Stack Overflow answers with some saying to use pivot, melt, stack, unstack, and more that it ends up being confusing.
So if anyone has an easy way to do this in PySpark, I am all ears. If not, I will happily take Pandas answers.
Thank you very much for your help!
explode
in pyspark see stackoverflow.com/questions/55378047/… – Umar.H