Joining multiple data frames in one statement and selecting only required columns

Question

I have the following Spark DataFrames:

df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)

I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.

SQL:

select df1.*,df2.salary,df3.dob
from df1    
left join df2 on df1.id=df2.id    
left join df3 on df1.name=df3.name

I tried something that looks like below in Pyspark using python but I am receiving an error.

joined_df = df1.join(df2,df1.id=df2.id,'left')\
               .join(df3,df1.name=df3.name)\
               .select(df1.(*),df2(name),df3(dob)

My question: Can we join all the three DataFrames in one go and select the required columns?

pault pault · Accepted Answer · 2018-06-27T13:02:37

If you have a SQL query that works, why not use pyspark-sql?

First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:

df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')

Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:

query = "select df1.*, df2.salary, df3.dob " \
        "from df1 " \
        "left join df2 on df1.id=df2.id "\
        "left join df3 on df1.name=df3.name"

joined_df = spark.sql(query)

Joining multiple data frames in one statement and selecting only required columns

2 Answers