I have a PySpark dataframe (say df1
) which has the following columns
1.> category
: some string
2.> array1
: an array of elements
3.> array2
: an array of elements
Following is an example of df1
+--------+--------------+--------------+
|category| array1| array2|
+--------+--------------+--------------+
|A | [x1, x2, x3]| [y1, y2, y3]|
|B | [u1, u2]| [v1, v2]|
+--------+--------------+--------------+
For each row, the length of array1
is equal to the length of array2
. In each column, I expect different rows to have different sizes of arrays for array1
(and array2
).
I want to form separate columns (say element1
and element2
) such that in each row, the columns element1
and element2
contain elements from same locations of array1
and array2
respectively.
Following is an example of the output dataframe (say df2
) that I want:
+--------+--------------+--------------+----------+----------+
|category| array1| array2| element1| element2|
+--------+--------------+--------------+----------+----------+
|A | [x1, x2, x3]| [y1, y2, y3]| x1| y1|
|A | [x1, x2, x3]| [y1, y2, y3]| x2| y2|
|A | [x1, x2, x3]| [y1, y2, y3]| x3| y3|
|B | [u1, u2]| [v1, v2]| u1| v1|
|B | [u1, u2]| [v1, v2]| u2| v2|
+--------+--------------+--------------+----------+----------+
Below is what I have tried till now (but it gives me values in element1
and element2
from different positions in addition to what I want.)
df2 = df1.select( "*", F.explode("array1").alias("element1") ).select( "*", F.explode("array2").alias("element2") )