How to join nested columns in spark with usingColumns

Question

I have 2 dataframes that I would like to join.

DF1:

root
 |-- myStruct: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- region: long (nullable = true)
 |-- first_name: string (nullable = true)

DF2:

root
 |-- id: string (nullable = true)
 |-- region: long (nullable = true)
 |-- second_name: string (nullable = true)

My join statement is

df1.join(df2, Seq("id", "region"), "leftouter")

but that fails with

USING column `id` cannot be resolved on the left side of the join. The left-side columns: myStruct, first_name

I am running Spark 2.2 on Scala

Ramesh Maharjan Ramesh Maharjan · Accepted Answer · 2018-08-25T06:00:58

You can use . notation to select an element from struct column. so to select id from df1 you will have to do myStruct.id and to select region you have to use myStruct.region.

And since the column names to be used are not same you can use === notation for comparison as

df1.join(df2, df1("myStruct.id") === df2("id") && df1("myStruct.region") === df2("region"), "leftouter")

You should have the joined dataframe with following schema

root
 |-- myStruct: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- region: long (nullable = false)
 |-- first_name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- region: integer (nullable = true)
 |-- second_name: string (nullable = true)

You can drop the unnecessary columns after join or select only needed columns after join

I hope the answer is helpful

How to join nested columns in spark with usingColumns

2 Answers