Pyspark Operations are Slower than Hive

Question

I have 3 dataframes df1, df2 and df3. Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)

These are the operations performed:

df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes

This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.

But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?

Even thought you are caching. The data needs to load into spark whereas they already in hive. — eliasah
Why do you use such outdated Spark version? There is huge performance gap between 1.3 and 1.6 — zero323

Zichu Lee Zichu Lee · Accepted Answer · 2019-09-30T21:46:42

You should be careful with the use of JOIN.

JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Pyspark Operations are Slower than Hive

1 Answers