I have previously written a Dataset[T] to a csv file.
In this case T is a case class that contains field x: Option[BigDecimal]
When I attempt to load the file back into a Dataset[T] I see the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.
I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:
val a = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(file)
.as[T]
My case classes reflect tables read from JDBC with Option[T]
used to represent a nullable field. Option[BigDecimal]
is used to receive a Decimal field from JDBC.
I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.
So my next attempt was this:
var df = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.schema(implicitly[Encoder[T]].schema)
.load(file)
val schema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
schema.foreach{ field =>
field.dataType match {
case t: DoubleType =>
df = df.withColumn(field.name,
col(field.name).cast(DecimalType(38,18)))
case _ => // do nothing
}
}
df.as[T]
Unfortunately my case class now contains all None
s rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.
It looks like I actually have two issues.
- Conversion from Double -> BigDecimal.
- Nullable fields are not being wrapped in Options.
Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.