1
votes

From the DataSet and RDD documentation,

DataSet:

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row

RDD:

RDD represents an immutable,partitioned collection of elements that can be operated on in parallel

Also, it is said the difference between them:

The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not.

I have two questions here;

  1. what does it mean dataset is collection of domain specific objects while RDD is collection of any object,Given a case class Person, I thought DataSet[Person] and RDD[Person] are both collection of domain specific objects

  2. dataset API is always strongly typed and optimized using schema where RDD is not Why is it said that dataset API always strongly typed while RDD not? I thought RDD[Person] is also strong typed

1
Who marks my question to be closed? Why should be it closed? - Tom

1 Answers

3
votes

Strongly typed Dataset (not DataFrame) is a collection of record types (Scala Products) which are mapped to internal storage format using so called Encoders, while RDD can store arbitrary serializable (Serializable or Kryo serializable object). Therefore as a container RDD is much more generic than Dataset.

Following:

. So dataset API is always strongly typed (...) where RDD is not.

is an utter absurd, showing that you shouldn't trust everything you can find on the Internet. In general Dataset API has significantly weaker type protections, than RDD. This is particularly obvious when working Dataset[Row], but applies to any Dataset.

Consider for example following:

case class FooBar(id: Int, foos: Seq[Int])

 Seq[(Integer, Integer)]((1, null))
   .toDF.select($"_1" as "id", array($"_2") as "foos")
   .as[FooBar]

which clearly breaks type safety.