Passing Spark dataframe between scala methods - Performance

Question

Recently, I have developed a Spark Streaming application using Scala and Spark. In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. For example,

implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }

However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. Now, these methods accept Dataframe as an argument.

Can anyone help me on,

Whether Scala's implicit classes creates any overhead during run-time?
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods.

Thanks in advance.

almost all scala libraries are using implicits, not to mention scala.collections and spark itself, why is that colleague of yours even use scala if he thinks thats bad? — shay__

shay__ shay__ · Accepted Answer · 2018-09-10T13:32:02

Whether Scala's implicit classes creates any overhead during run-time?

Not in your case. There is some overhead when the implicit type is AnyVal (thus needs to be boxed). Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead.

Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?

No, no more then any other type. Obviously there will be no serialization.

... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object.

Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. In this case, Spark might serialize the wrapping class, but it's easy to avoid that. Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example).

It is very common (and handy) to use extension implicit classes like you did.

Passing Spark dataframe between scala methods - Performance

1 Answers