Spark data frame execution

Question

I want to understand the spark dataframe execution. I have gone through the logs and explain plan, but I am not clear. My question is say I have the spark program and I have some series of data frames like below which does something

df1 = gets some data
df2 = gets some other data
df3 = df1.join(df2....)
df4= df3.join(some other data set)
df5 = df3.join(some other data set)
d6 = d4.join(some other data set)
d7 = d5.join(some other data set)
d6.write...()
d7.write...()

Lets say above are the series of dataframe. So my question is, when d6.write is issued, does df1,df2,df3,df4 gets executed and when d7.write is issued, does again df1,df2,df3,d5 gets executed ? Is it a good idea to persist d3 dataframe ?

Gsquare Gsquare · Accepted Answer · 2017-03-26T08:01:51

when d6.write is issued, does df1,df2,df3,df4 gets executed and when d7.write is issued, does again df1,df2,df3,d5 gets executed ?

Ans: Yes

Is it a good idea to persist d3 dataframe ?

Ans: Yes

Spark data frame execution

1 Answers