Spark Dataframe Join - Duplicate column (non-joined column)

Question

I have two Dataframes df1 (Employee table) & df2 (Department table) with following schema :

df1.columns
// Arrays(id,name,dept_id)

and

df2.columns
// Array(id,name)

After i join these two tables on df1.dept_id and df2.id :

val joinedData = df1.join(df2,df1("dept_id")===df2("id"))
joinedData.columns
// Array(id,name,dept_id,id,name)

While saving it in file,

joined.write.csv("<path>")

it gives error :

 org.apache.spark.sql.AnalysisException: Duplicate column(s) : "name", "id" found, cannot save to file.;

I read about using Sequence of Strings to avoid column duplication but that is for columns on which join is to be performed. I need a similar functionality for non-joined columns.

Is there a direct way to embed table name with repeated column so that it can be saved ?

I came up with a solution of matching columns of both dfs and renaming duplicate columns to append table-name to column-name. But is there a direct way ?

Note : This will be a generic code with only column details on which join is performed. Rest columns will be known at runtime only. So we can't rename columns by hard-coding it.

Raphael Roth Raphael Roth · Accepted Answer · 2018-01-17T09:54:25

I would just keep all columns by making sure they have different names, e.g. by prepending an identifier to the column names:

val df1Cols = df1.columns
val df2Cols = df2.columns

// prefixes to column names
val df1pf = df1.select(df1Cols.map(n => col(n).as("df1_"+n)):_*)
val df2pf = df2.select(df2Cols.map(n => col(n).as("df2_"+n)):_*)

df1pf.join(df2pf,
    $"df1_dept_id"===$"df2_id",
 )

Spark Dataframe Join - Duplicate column (non-joined column)

4 Answers