2
votes

How can I create a spark Dataset with a BigDecimal at a given precision? See the following example in the spark shell. You will see I can create a DataFrame with my desired BigDecimal precision, but cannot then convert it to a Dataset.

scala> import scala.collection.JavaConverters._
scala> case class BD(dec: BigDecimal)
scala> val schema = StructType(Seq(StructField("dec", DecimalType(38, 0))))
scala> val highPrecisionDf = spark.createDataFrame(List(Seq(BigDecimal("12345678901122334455667788990011122233"))).map(a => Row.fromSeq(a)).asJava, schema)
highPrecisionDf: org.apache.spark.sql.DataFrame = [dec: decimal(38,0)]
scala> highPrecisionDf.as[BD]
org.apache.spark.sql.AnalysisException: Cannot up cast `dec` from decimal(38,0) to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "scala.math.BigDecimal", name: "dec")
- root class: "BD"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;

Similarly I am unable to create a Dataset from a case class where I've used a higher precision BigDecimal.

scala> List(BD(BigDecimal("12345678901122334455667788990011122233"))).toDS.show()
+----+
| dec|
+----+
|null|
+----+

Is there any way to create a Dataset containing a BigDecimal field with precision different to the default decimal(38,18)?

2

2 Answers

3
votes

By default spark will infer the schema of the Decimal type (or BigDecimal) in a case class to be DecimalType(38, 18) (see org.apache.spark.sql.types.DecimalType.SYSTEM_DEFAULT)

The workaround is to convert the dataset to dataframe as below

case class TestClass(id: String, money: BigDecimal)

val testDs = spark.createDataset(Seq(
  TestClass("1", BigDecimal("22.50")),
  TestClass("2", BigDecimal("500.66"))
))

testDs.printSchema()

root
 |-- id: string (nullable = true)
 |-- money: decimal(38,18) (nullable = true)

Workaround

import org.apache.spark.sql.types.DecimalType
val testDf = testDs.toDF()

testDf
  .withColumn("money", testDf("money").cast(DecimalType(10,2)))
  .printSchema()

root
 |-- id: string (nullable = true)
 |-- money: decimal(10,2) (nullable = true)

You can check this link for finer details https://issues.apache.org/jira/browse/SPARK-18484)

0
votes

One workaround I found is to use a String in the Dataset instead to maintain precision. This solution works providing you don't need to use the values as numbers (e.g. ordering or maths). If you need to do that you can turn it back into a DataFrame, cast to the appropriate high accuracy type, and convert back to your Dataset afterwards.

val highPrecisionDf = spark.createDataFrame(List(Seq(BigDecimal("12345678901122334455667788990011122233"))).map(a => Row.fromSeq(a)).asJava, schema)
case class StringDecimal(dec: String)
highPrecisionDf.as[StringDecimal]