13
votes

I'm extracting data from MySQL/MariaDB and during creation of Dataset, an error occurs with the data types

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast AMOUNT from decimal(30,6) to decimal(38,18) as it may truncate The type path of the target object is: - field (class: "org.apache.spark.sql.types.Decimal", name: "AMOUNT") - root class: "com.misp.spark.Deal" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;

Case class is defined like this

case class
(
AMOUNT: Decimal
)

Anyone know how to fix it and not touch the database?

5

5 Answers

15
votes

That error says that apache spark can’t automatically convert BigDecimal(30,6) from database to BigDecimal(38,18) which wanted in Dataset (I don't know why it needs fixed paramers 38,18. And it is even more strange that spark can’t automatically convert type with low precision to type with high precision).

There was reported a bug: https://issues.apache.org/jira/browse/SPARK-20162 (maybe it was you). Anyway I found good workaround for reading data through casting columns to BigDecimal(38,18) in dataframe and then casting dataframe to dataset.

//first read data to dataframe with any way suitable for you
var df: DataFrame = ???
val dfSchema = df.schema

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foreach { field =>
  field.dataType match {
    case t: DecimalType if t != DecimalType(38, 18) =>
      df = df.withColumn(field.name, col(field.name).cast(DecimalType(38,18)))
  }
}
df.as[YourCaseClassWithBigDecimal]

It should solve problems with reading (but not with writing I guess)

4
votes

As was previously stated, since your DB uses DecimalType(30,6) means you have 30 slots total and 6 slots past the decimal point which leaves 30-6=24 for the area in front of the decimal point. I like to call it a (24 left, 6 right) big-decimal. This of-course does not fit into a (20 left, 18 right) (i.e. DecimalType(38,18)) since the latter does not have enough slots on the left (20 vs 24 needed). We only have 20 left-slots in a DecimalType(38,18) but we need 24 left-slots to accomodate your DecimalType(30,6).

What we can do here, is to down-cast the (24 left, 6 right) into a (20 left, 6 right) (i.e. DecimalType(26,6)) so that when it's being auto-casted to a (20 left, 18 right) (I.e. DecimalType(38,18)) both sides will fit. Your DecimalType(26,6) will have 20 left-slots allowing it to fit inside of a DecimalType(38,18) and of-course 6 rights slots will fit into the 18.

The way you do that is before converting anything to a Dataset, run the following operation on the DataFrame:

val downCastableData = 
  originalData.withColumn("amount", $"amount".cast(DecimalType(26,6)))

Then converting to Dataset should work.

(Actually, you can cast to anything that's (20 left, 6 right) or less e.g. (19 left, 5 right) etc...).

3
votes

While I don't have a solution here is my understanding of what is going on:

By default spark will infer the schema of the Decimal type (or BigDecimal) in a case class to be DecimalType(38, 18) (see org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT). The 38 means the Decimal can hold 38 digits total (for both left and right of the decimal point) while the 18 means 18 of those 38 digits are reserved for the right of the decimal point. That means a Decimal(38, 18) may have 20 digits for the left of the decimal point. Your MySQL schema is decimal(30, 6) which means it may contain values with 24 digits (30 - 6) to the left of the decimal point and 6 digits to the right of the decimal point. Since 24 digits is greater than 20 digits there could be values that are truncated when converting from your MySQL schema to that Decimal type.

Unfortunately inferring schema from a scala case class is considered a convenience by the spark developers and they have chosen to not support allowing the programmer to specify precision and scale for Decimal or BigDecimal types within the case class (see https://issues.apache.org/jira/browse/SPARK-18484)

2
votes

Building on @user2737635's answer, you can use a foldLeft rather than foreach to avoid defining your dataset as a var and redefining it:

//first read data to dataframe with any way suitable for you
val df: DataFrame = ???
val dfSchema = df.schema

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DecimalType
dfSchema.foldLeft(df){ 
  (dataframe, field) =>  field.dataType match {
    case t: DecimalType if t != DecimalType(38, 18) => dataframe.withColumn(field.name, col(field.name).cast(DecimalType(38, 18)))
    case _ => dataframe
  }
}.as[YourCaseClassWithBigDecimal]
-2
votes

According to pyspark, the Decimal(38,18) is default.

When create a DecimalType, the default precision and scale is (10, 0). When infer schema from decimal.Decimal objects, it will be DecimalType(38, 18).