1
votes

I have two tables with the following schema as example:

   scala> df1.printSchema
   root
    |-- id: string (nullable = true)

   AND

   scala> df2.printSchema
   root
    |-- col1: string (nullable = true)
    |-- col2: array (nullable = true)
    |    |-- element: string (containsNull = true)

I want to get all col1 in df2 where an element in col2 array is equal to id in df1. Something such as df3 is output:

   scala> df3.printSchema
   root
    |-- c1: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- c2: string (nullable = true)

where df3.c2 is basically df1.id and df3.c1 is array of all df2.col1 that satisfy the mentioned equality.

any SQL (hive) or Scala solution is very helpful.

2

2 Answers

4
votes

In Hive:

select collect_set(df2.col1) as col1, df1.id as col2
 from df1
inner join 
(
select --explode col2 array 
      col1, s.c2 as col2 
 from df2 lateral view explode(col2) s as c2 
) df2 on df1.id = df2.col2
group by df1.id;
1
votes

I don't think you need a subquery for this:

select collect_set(df2.col1) as col1, df1.id as col2
from df2 lateral view
     explode(col2) s as c2 join
     df1
     on df1.id = s.c2
group by df1.id;