I have static lists group_1
and group_2
:
group_1 = [a,b,c,d,e,f,g]
group_2 = [h,i,j,k]
I have pyspark dataframe df1
as shown below.
Example1:
df1:
+-----+----------------------------------------+-----------------------------------------+
|id |array1 |array2 |
+-----+----------------------------------------+-----------------------------------------+
|id1 |[a,b,c,d,group_1,group_2] |[a,b,c,d,e,f,g,h,i,j,k] |
+-----+----------------------------------------+-----------------------------------------+
output_df:
+-----+-------------------|-------------------|
|id |col1 |col2 |
+-----+-------------------|-------------------|
|id1 |[a,b,c,d] |[a,b,c,d] |
|id1 |[e,f,g] |group_1 |
|id1 |[h,i,j,k] |group_2 |
+-----+-------------------|-------------------|
Actually, array2
column will have elements from array1
column. That's how my source dataframe (source_df1
) will be.
If we see array1
column there are individual elements like (a,b,c,d)
and also group_1
and group_2
elements but all together they are distinct.
Now I want to create pyspark dataframe by exploding such a way that individual and group elements are categorized as shown in output_df
.
Example1 Observation: If we see the output dataframe output_df
, the second record group_1
has only [e,f,g]
because other elements are already part of individual elements.
Example2:
source_df1:
+-----+----------------------------------------+-----------------------------------------+
|id |array1 |array2 |
+-----+----------------------------------------+-----------------------------------------+
|id1 |[a,b,group_1,group_2] |[a,b,c,d,e,f,g,h,i,j,k] |
+-----+----------------------------------------+-----------------------------------------+
output_df:
+-----+-------------------|-------------------|
|id |col1 |col2 |
+-----+-------------------|-------------------|
|id1 |[a,b] |[a,b] |
|id1 |[c,d,e,f,g] |group_1 |
|id1 |[h,i,j,k] |group_2 |
+-----+-------------------|-------------------|
Example2 Observation: If we see the output dataframe output_df
. the second record group_1
has only [c,d,e,f,g]
because other elements are already part of individual elements.
Can anyone please help on achieving this?