I am assessing the performance of different ways of loading Parquet files in Spark and the differences are staggering.
In our Parquet files, we have nested case classes of the type:
case class C(/* a dozen of attributes*/)
case class B(/* a dozen of attributes*/, cs: Seq[C])
case class A(/* a dozen of attributes*/, bs: Seq[B])
It takes a while to load them from Parquet files. So I've done a benchmark of different ways of loading case classes from Parquet files and summing a field using Spark 1.6 and 2.0.
Here is a summary of the benchmark I did:
val df: DataFrame = sqlContext.read.parquet("path/to/file.gz.parquet").persist()
df.count()
// Spark 1.6
// Play Json
// 63.169s
df.toJSON.flatMap(s => Try(Json.parse(s).as[A]).toOption)
.map(_.fieldToSum).sum()
// Direct access to field using Spark Row
// 2.811s
df.map(row => row.getAs[Long]("fieldToSum")).sum()
// Some small library we developed that access fields using Spark Row
// 10.401s
df.toRDD[A].map(_.fieldToSum).sum()
// Dataframe hybrid SQL API
// 0.239s
df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)
// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)
// Dataset with column selection
// 0.176s
df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)
// Spark 2.0
// Performance is similar except for:
// Direct access to field using Spark Row
// 23.168s
df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)
// Some small library we developed that access fields using Spark Row
// 32.898s
f1DF.toRDD[A].map(_.fieldToSum).sum()
I understand why the performance of methods using Spark Row is degraded when upgrading to Spark 2.0, since Dataframe is now a mere alias of Dataset[Row].
That's the cost of unifying the interfaces, I guess.
On the other hand, I'm quite disappointed that the promise of Dataset is not kept: performance when using RDD-style coding (maps and flatMaps) is worse than when using Dataset like Dataframe with SQL-like DSL.
Basically, to have good performance, we need to give up type safety.
What is the reason for such difference between
Datasetused as RDD andDatasetused asDataframe?Is there a way to improve encoding performance in
Datasetto equate RDD-style coding and SQL-style coding performance? For data engineering, it's much cleaner to have RDD-style coding.Also, working with the SQL-like DSL would require to flatten our data model and not use nested case classes. Am I right that good performance is only achieved with flat data models?