15
votes

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.

Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.

Currently all our data engineering flows are using Spark (Scala)DataFrame. We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.

EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets

2
It is an odd question given that it is the way forward.thebluephantom
why so? There should be some scenarios where spark dataframes are best suited, we know that DataFrame is Dataset[Row]Ranga Vure
in any event i did not do a minus 1. i am not a fan of row. eventually ds will prevail. it can use mapPartitions if i remember correctly. it blends rdd stuff as well.thebluephantom
Possible duplicate of Spark 2.0 Dataset vs DataFrameRaphael Roth

2 Answers

12
votes

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.

For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.

Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.

There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.

In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.

1
votes

Limitations of Spark Datasets:

  1. Datasets used to be less performant (not sure if that's been fixed yet)
  2. You need to define a new case class whenever you change the Dataset schema, which is cumbersome
  3. Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date

case class Birth(hospitalName: String, birthDate: Date)

val birthsDS = Seq(
  Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+