1
votes

I have a decimal column "TOT_AMT" defined as type "bytes" and logical type "decimal" in my avro schema.

After creating the data frame in spark using databricks spark-avro, when I tried to sum the TOT_AMT column using the sum function it throws "Function sum requires numeric types not Binarytype" error.

The column is defined like below in avro schema,

name="TOT_AMT","type":["null",{ "type":"bytes","logicaltype":"decimal","precision":20,"scale":10}]

I am creating dataframe and summing up like,

val df=sqlContext.read.format("com.databricks.spark.avro").load("input dir")
df.agg(sum("TOT_AMT")).show()

It seems that the decimal value is read as Binarytype while creating dataframe. In such a case how can we perform numeric operations on such decimal columns? Will it be possible to convert this Byte array to BigDecimal and then perform calculations.

1
Can you provide a schematic code or overview of your data? Especially the state of your current RDD before the reduction could be important. Explicit typecasting will most likely do the trick.dennlinger

1 Answers

0
votes

According to Supported types for Avro -> Spark SQL conversion, bytes Avro type is converted to Spark SQL's BinaryType (see also the code).

According to the source code you can define your own custom schema using avroSchema option, i.e.

spark.read
  .format("com.databricks.spark.avro")
  .option("avroSchema", yourSchemaHere)

That gives you the way to specify the mapping from BinaryType to Decimal.

You can also use cast function to cast a binary value to their decimal format.

p.s. I don't know if the reader supports logicaltype hints defined in a Avro schema. It'd be nice to have such a feature if not available currently.