1
votes

We have a use case where we need to do some columnar transformations on avro datasets. We used to run MR jobs till now and now want to explore spark. I am going through some tutorials and am not sure whether we should use RDD or Dataframe/Dataset. Since Dataframes are stored columnar, is it a right choice to use Dataframes, as all my transformations are columnar in nature? Or does it not make much difference as internally everything is based on RDDs?

2

2 Answers

1
votes

From a performance standpoint, your data format won't have any effect on the API you're using to describe the transformations.

I would advise going with the most high-level API possible (DataFrames), and only switching to RDDs if some operation you need can't be implemented in any other way.

1
votes

Trying to answer your question I ran into comprehensive comparison between all three data structures.

The answer in each particular case depends on the nature of your transformations rather than particular serialization format. In general using higher level API gives more convenience, but low level APIs (RDD) more flexibility and control.