1
votes

I am getting the following error when trying to perform a cast on a column (read from a comma separated csv file with headers).

Here is the code I am using:

var df = spark.read.option("header","true").option("delimiter",",").csv("/user/sample/data")
df.withColumn("columnCast", expr("CAST(SaleAmount) AS LONG")).count

This causes the following exception to be thrown every time. I've tried different columns when casting and some throw while others do not. I've also tried the following which also throws the same exception.

df.withColumn("columnCast", expr("CAST(NULL) AS LONG")).count

java.lang.UnsupportedOperationException: empty.init at scala.collection.TraversableLike$class.init(TraversableLike.scala:451) at scala.collection.mutable.ArrayOps$ofInt.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:234) at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:135) at scala.collection.mutable.ArrayOps$ofInt.init(ArrayOps.scala:234) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:565) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:558) at scala.Option.getOrElse(Option.scala:121)

I have tried running this both in spark-shell and zeppelin. Spark version is 2.4.0.cloudera2 managed by Cloudera.

What is causing this behaviour? Is this intended? How do I handle this?

1
Can you run spark-shell --version and include the output to your question? Also, can you run spark-shell and execute spark.catalog.listFunctions.count? What's the output? I think there's something wrong with the Spark environment and any query would simply fail.Jacek Laskowski
That exception looks like SPARK-28521 that happens in 2.4.3 and got resolved in 3.0.0 (that is not available yet).Jacek Laskowski

1 Answers

2
votes

You can use column's cast method to do the cast:

import spark.implicits._

val df = spark.sparkContext.parallelize(1 to 10).toDF("col1")
val casted = df.withColumn("test", lit(null).cast("string"))
               .withColumn("testCast", $"test".cast("long"))
casted.show()
casted.printSchema()

Result:

+----+----+--------+
|col1|test|testCast|
+----+----+--------+
|   1|null|    null|
|   2|null|    null|
|   3|null|    null|
|   4|null|    null|
|   5|null|    null|
|   6|null|    null|
|   7|null|    null|
|   8|null|    null|
|   9|null|    null|
|  10|null|    null|
+----+----+--------+

root
 |-- col1: integer (nullable = false)
 |-- test: string (nullable = true)
 |-- testCast: long (nullable = true)