0
votes

I have this code and almost all of the transformations use withColumn function which returns a data frame. I convert the dataframe returned from preProcessing to Dataset using as[Recipe] but since all the functions return dataframe using .as over and over doesn't make sense.

So my question is what's the use case of DataSet[U] over Dataset[Row]/DataFrame? And is it worth using Dataset in my case as with each transformation(with column) the schema changes?

case class Recipe(
    name: String,
    ingredients: String,
    url: String,
    image: String,
    cookTime: String,
    recipeYield: String,
    datePublished: DateType,
    prepTime: String,
    description: String
)

private def preProcessing[T](spark: SparkSession, data: DataFrame): DataFrame = {
    data
      .transform(lowerCaseColumn("ingredients"))
      .transform(lowerCaseColumn("name"))
      .transform(covertStringToDate("datePublished"))
  }

private def transform[T](
      spark: SparkSession,
      data: Dataset[Recipe]
  ): DataFrame = {
    data
      .transform(filterRecipesWithBeef())
      .persist(StorageLevel.MEMORY_AND_DISK_SER)
      .transform(covertRecipeTimeColToMinutes("cookTime"))
      .transform(covertRecipeTimeColToMinutes("prepTime"))
      .transform(calculateTotalCookingTime())
      .transform(calculateRecipeDifficulty())
      .transform(calculateAvgCookingtimeByDifficulty())
  }
2

2 Answers

2
votes

Consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala. That means that Dataset[T] has syntax errors and analysis errors shown at compile-time

1
votes

Datasets are imho work in progress. Indeed type safety and compile time errors as stated in the first answer and elsewhere in blogs, but with 2.x - may be in v3.x different, there are many issues to consider. E.g.

  • There are untyped DF functions that do not extend to DSs.

  • What about grouping by, aggregations, etc.? Names are lost.

  • JSON flexible schemas?

I am an architect, but noted during a stint as Data Engineer, that usage of datasets was not as easy as touted. DFs for moment still more practical. I had a quick scan of Spark 3 but could not note any drastic changes. I looked at two items but could not find anything to obviate my comments here (yet).