Spark SQL - Read csv into Dataset[T] where T is a case class of Option[BigDecimal] field

Question

I have previously written a Dataset[T] to a csv file.

In this case T is a case class that contains field x: Option[BigDecimal]

When I attempt to load the file back into a Dataset[T] I see the following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.

I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:

   val a = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(file)
    .as[T]

My case classes reflect tables read from JDBC with Option[T] used to represent a nullable field. Option[BigDecimal] is used to receive a Decimal field from JDBC.

I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.

So my next attempt was this:

   var df = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .schema(implicitly[Encoder[T]].schema)
    .load(file)

  val schema = df.schema

  import org.apache.spark.sql.functions._
  import org.apache.spark.sql.types._

  schema.foreach{ field =>
    field.dataType match {
      case t: DoubleType =>
        df = df.withColumn(field.name, 
          col(field.name).cast(DecimalType(38,18)))
      case _ => // do nothing
    }
  }

  df.as[T]

Unfortunately my case class now contains all Nones rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.

It looks like I actually have two issues.

Conversion from Double -> BigDecimal.
Nullable fields are not being wrapped in Options.

Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.

hi @Terry two questions? Is there any specific reason you dont want to cast to BigDecimal? Can you post some sample data? — abiratsis
I have updated my original post to provide more information. — Terry Dactyl
The case class would look like MyCaseClass(a: String, b Option[BigDecimal]) where a id non nullable, b is nullable etc. There are no funky types involved and the standard Product Encoder seems to work for reading from JDBC into Datasets — Terry Dactyl
so if you change the class to MyCaseClass(a: String, b Option[Double]) it works? — abiratsis

abiratsis abiratsis · Accepted Answer · 2018-03-26T16:59:51

First I would fill null values with dfB.na.fill(0.0) then I would try the next solution:

case class MyCaseClass(id: String, cost: Option[BigDecimal])
var dfB = spark.createDataset(Seq(
  ("a", Option(12.45)),
  ("b", Option(null.asInstanceOf[Double])),
  ("c", Option(123.33)),
  ("d", Option(1.3444))
)).toDF("id", "cost")

dfB
  .na.fill(0.0)
  .withColumn("cost", col("cost").cast(DecimalType(38,18)))
  .as[MyCaseClass]
  .show()

First cast the column cost into DecimalType(38,18) explicitly then retrieve the dataset[MyCaseClass]. I believe the issue here was that the spark can't convert double to BigDecimal without specifying scale-precision explicitly therefore you need first to convert it into a specific decimal type and then use it as BigDecimal.

UPDATE: I slightly modified the previous code to make possible to handle members of type Option[BigDecimal] as well

Good luck

Spark SQL - Read csv into Dataset[T] where T is a case class of Option[BigDecimal] field

1 Answers