I have a table (dataframe) as following:
scala> df1.printSchema
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
I need to create an array of data in col1 for each element in col2, as follow:
scala> df2.printSchema
root
|-- id: string (nullable = true)
|-- c1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- c2: string (nullable = true)
df2.c2 is for each element in df1.col2 and df2.c1 is array of df1.col1 element.
SQL (hive) or Spark/Scala can be helpful.
More explanation:
df1:
+----------------------------+
| id | col1 | col2 |
+----------------------------+
| 1 | q1 |[i1, i2] |
| 1 | q2 |[i1, i3] |
| 1 | q3 |[i2, i4] |
| 2 | q4 |[i5] |
| 2 | q5 |[i6] |
| 3 | q6 |[i7,i1,i2] |
| 3 | q7 |[i1] |
+----------------------------+
df2:
+----------------------------+
| id | c1 | c2 |
+----------------------------+
| 1 | [q1, q2] | i1 |
| 1 | [q1, q3] | i2 |
| 1 | [q2] | i3 |
| 1 | [q3] | i4 |
| 2 | [q4] | i5 |
| 2 | [q5] | i6 |
| 3 | [q6] | i7 |
| 3 | [q6, q7] | i1 |
| 3 | [q6] | i2 |
+----------------------------+