3
votes

I have previously written a Dataset[T] to a csv file.

In this case T is a case class that contains field x: Option[BigDecimal]

When I attempt to load the file back into a Dataset[T] I see the following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.

I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:

   val a = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(file)
    .as[T]

My case classes reflect tables read from JDBC with Option[T] used to represent a nullable field. Option[BigDecimal] is used to receive a Decimal field from JDBC.

I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.

So my next attempt was this:

   var df = spark
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .schema(implicitly[Encoder[T]].schema)
    .load(file)

  val schema = df.schema

  import org.apache.spark.sql.functions._
  import org.apache.spark.sql.types._

  schema.foreach{ field =>
    field.dataType match {
      case t: DoubleType =>
        df = df.withColumn(field.name, 
          col(field.name).cast(DecimalType(38,18)))
      case _ => // do nothing
    }
  }

  df.as[T]

Unfortunately my case class now contains all Nones rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.

It looks like I actually have two issues.

  1. Conversion from Double -> BigDecimal.
  2. Nullable fields are not being wrapped in Options.

Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.

1
hi @Terry two questions? Is there any specific reason you dont want to cast to BigDecimal? Can you post some sample data?abiratsis
I have updated my original post to provide more information.Terry Dactyl
hello @Terry can you post that case class as well?abiratsis
The case class would look like MyCaseClass(a: String, b Option[BigDecimal]) where a id non nullable, b is nullable etc. There are no funky types involved and the standard Product Encoder seems to work for reading from JDBC into DatasetsTerry Dactyl
so if you change the class to MyCaseClass(a: String, b Option[Double]) it works?abiratsis

1 Answers

1
votes

First I would fill null values with dfB.na.fill(0.0) then I would try the next solution:

case class MyCaseClass(id: String, cost: Option[BigDecimal])
var dfB = spark.createDataset(Seq(
  ("a", Option(12.45)),
  ("b", Option(null.asInstanceOf[Double])),
  ("c", Option(123.33)),
  ("d", Option(1.3444))
)).toDF("id", "cost")

dfB
  .na.fill(0.0)
  .withColumn("cost", col("cost").cast(DecimalType(38,18)))
  .as[MyCaseClass]
  .show()

First cast the column cost into DecimalType(38,18) explicitly then retrieve the dataset[MyCaseClass]. I believe the issue here was that the spark can't convert double to BigDecimal without specifying scale-precision explicitly therefore you need first to convert it into a specific decimal type and then use it as BigDecimal.

UPDATE: I slightly modified the previous code to make possible to handle members of type Option[BigDecimal] as well

Good luck