0
votes

I have 3 dataframes df1, df2 and df3. Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)

These are the operations performed:

  • df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)

  • df_final = df_new outer join df3

  • df_split1 = df_final filtered using condition1

  • df_split2 = df_final filtered using condition2

  • write df_split1,df_split2 into a single table after performing different operations on both dataframes

This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.

But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?

1
Even thought you are caching. The data needs to load into spark whereas they already in hive. - eliasah
How do you describe your condition1/condition2? - Reactormonk
does this mean hive is faster than spark? - Mohan
Conditions are ColX is null, ColY is not null - Mohan
Why do you use such outdated Spark version? There is huge performance gap between 1.3 and 1.6 - zero323

1 Answers

0
votes

You should be careful with the use of JOIN.

JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.