Aggregate data from a dataframe column based on another column in Scala

Question

I have a table (dataframe) as following:

    scala> df1.printSchema
      root
       |-- id: string (nullable = true)
       |-- col1: string (nullable = true)
       |-- col2: array (nullable = true)
       |    |-- element: string (containsNull = true)

I need to create an array of data in col1 for each element in col2, as follow:

     scala> df2.printSchema
      root
       |-- id: string (nullable = true)
       |-- c1: array (nullable = true)
       |    |-- element: string (containsNull = true)
       |-- c2: string (nullable = true)

df2.c2 is for each element in df1.col2 and df2.c1 is array of df1.col1 element.

SQL (hive) or Spark/Scala can be helpful.

More explanation:

df1:

  +----------------------------+
  | id | col1 |       col2     |
  +----------------------------+
  | 1  |  q1  |[i1, i2]        |
  | 1  |  q2  |[i1, i3]        |
  | 1  |  q3  |[i2, i4]        |
  | 2  |  q4  |[i5]            |
  | 2  |  q5  |[i6]            |
  | 3  |  q6  |[i7,i1,i2]      |
  | 3  |  q7  |[i1]            |
  +----------------------------+

df2:

  +----------------------------+
  | id |    c1      |    c2    |
  +----------------------------+
  | 1  |  [q1, q2]  |    i1    |
  | 1  |  [q1, q3]  |    i2    |
  | 1  |  [q2]      |    i3    |
  | 1  |  [q3]      |    i4    |
  | 2  |  [q4]      |    i5    |
  | 2  |  [q5]      |    i6    |
  | 3  |  [q6]      |    i7    |
  | 3  |  [q6, q7]  |    i1    |
  | 3  |  [q6]      |    i2    |
  +----------------------------+

could you please also add some sample data and your expectation? would be easy to understand the req. am little confused on this line "df1.col1 will be added to df2.c2 if df1.col2 element is in the same row". Thanks — vikrant rana
i removed it, since it was confusing. It was extra not-needed explanation and basically i need to collect all entries in col1 into a set for each element of the array in col2. — Alan
Is this question duplicated: stackoverflow.com/q/57447955/2700344 ? — leftjoin
No. what I did is: first explode col2 and then group by for both id and col1. if it is correct? — Alan
You need to aggregate col1 group by id, col2(exploded). See my answer — leftjoin

leftjoin leftjoin · Accepted Answer · 2019-08-12T07:25:32

First explode col2, then aggregate col1 array using collect_set:

select  d.id, collect_set(d.col1) as c1, s.c2
   from df1 d lateral view explode(d.col2) s as c2
group by d.id, s.c2

Aggregate data from a dataframe column based on another column in Scala

1 Answers