5
votes

I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.

Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"

In principle this is exactly what I want, but I cannot get it to work.

Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala



    // definition of data
    val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")

As expected the schema of data is:


    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)
    

I would like to cast the column "b" to Double. So I try the following:



    import session.implicits._;

    println(" --------------------------- Casting using (String Double)")

    val data_TupleCast=data.as[(String, Double)]
    data_TupleCast.show()
    data_TupleCast.printSchema()

    println(" --------------------------- Casting using ClassData_Double")

    case class ClassData_Double(a: String, b: Double)

    val data_ClassCast= data.as[ClassData_Double]
    data_ClassCast.show()
    data_ClassCast.printSchema()

As I understand the definition of as[u], the new DataFrames should have the following schema


    root
     |-- a: string (nullable = true)
     |-- b: double (nullable = false)

But the output is


     --------------------------- Casting using (String Double)
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

     --------------------------- Casting using ClassData_Double
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

which shows that column "b" has not been cast to double.

Any hints on what I am doing wrong?

BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).

1
I don't think you can - the as[U] API does not change the actual types, it just provides a typed API for handling the dataset; U must match the actual types, and changing the actual types can only be done via transformations such as Column.cast as explained in the question you linked.Tzach Zohar

1 Answers

5
votes

Well, since functions are chained and Spark does lazy evaluation, it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:

import spark.implicits._

df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...

As an alternative, I'm thinking you could use map to do your cast in one go, like:

df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}