I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.
Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"
In principle this is exactly what I want, but I cannot get it to work.
Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
// definition of data
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
As expected the schema of data is:
root |-- a: string (nullable = true) |-- b: integer (nullable = false)
I would like to cast the column "b" to Double. So I try the following:
import session.implicits._;
println(" --------------------------- Casting using (String Double)")
val data_TupleCast=data.as[(String, Double)]
data_TupleCast.show()
data_TupleCast.printSchema()
println(" --------------------------- Casting using ClassData_Double")
case class ClassData_Double(a: String, b: Double)
val data_ClassCast= data.as[ClassData_Double]
data_ClassCast.show()
data_ClassCast.printSchema()
As I understand the definition of as[u], the new DataFrames should have the following schema
root |-- a: string (nullable = true) |-- b: double (nullable = false)
But the output is
--------------------------- Casting using (String Double) +---+---+ | a| b| +---+---+ | a| 1| | b| 2| +---+---+ root |-- a: string (nullable = true) |-- b: integer (nullable = false) --------------------------- Casting using ClassData_Double +---+---+ | a| b| +---+---+ | a| 1| | b| 2| +---+---+ root |-- a: string (nullable = true) |-- b: integer (nullable = false)
which shows that column "b" has not been cast to double.
Any hints on what I am doing wrong?
BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).
as[U]
API does not change the actual types, it just provides a typed API for handling the dataset;U
must match the actual types, and changing the actual types can only be done via transformations such asColumn.cast
as explained in the question you linked. – Tzach Zohar