From the DataSet and RDD documentation,
DataSet:
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row
RDD:
RDD represents an immutable,partitioned collection of elements that can be operated on in parallel
Also, it is said the difference between them:
The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not.
I have two questions here;
what does it mean
dataset is collection of domain specific objects while RDD is collection of any object,Given a case classPerson, I thought DataSet[Person] and RDD[Person] are both collection of domain specific objectsdataset API is always strongly typed and optimized using schema where RDD is notWhy is it said that dataset API always strongly typed while RDD not? I thought RDD[Person] is also strong typed