2
votes

I have 2 dataframes that I would like to join.

DF1:

root
 |-- myStruct: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- region: long (nullable = true)
 |-- first_name: string (nullable = true)

DF2:

root
 |-- id: string (nullable = true)
 |-- region: long (nullable = true)
 |-- second_name: string (nullable = true)

My join statement is

df1.join(df2, Seq("id", "region"), "leftouter")

but that fails with

USING column `id` cannot be resolved on the left side of the join. The left-side columns: myStruct, first_name

I am running Spark 2.2 on Scala

2

2 Answers

4
votes

You can use . notation to select an element from struct column. so to select id from df1 you will have to do myStruct.id and to select region you have to use myStruct.region.

And since the column names to be used are not same you can use === notation for comparison as

df1.join(df2, df1("myStruct.id") === df2("id") && df1("myStruct.region") === df2("region"), "leftouter")

You should have the joined dataframe with following schema

root
 |-- myStruct: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- region: long (nullable = false)
 |-- first_name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- region: integer (nullable = true)
 |-- second_name: string (nullable = true)

You can drop the unnecessary columns after join or select only needed columns after join

I hope the answer is helpful

0
votes

This is because in DF1, id is an element of column myStruct which is of type struct. In order to join you can do something like,

val df = df1
.withColumn("id", col("myStruct.id"))
.withColumn("region", col("myStruct.region"))

df.join(df2, Seq("id", "region"), "leftouter")

This essentially extracts id and region from the struct column.