0
votes

We are maintaining an Hive data warehouse and use sparkSQL to make queries against the hive database and generate reports. We are using Spark 1.6 in AWS EMR enviounment and that has been working fine. I wanted to upgrade our environments to spark 2.0 but I am getting a very strange casting error with date fields. Any existing table that contains a column with a DATE type is throwing java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date error when queried in spark 2.0.

here is a simplified example of table you would find in our Hive database:

CREATE TABLE IF NOT EXISTS test.test_table ( column_1 STRING, column_2 STRING, ) PARTITIONED BY (column_3 DATE) STORED AS PARQUETFILE ;

The query SELECT * FROM test.test_table limit 5 fails with the above error in spark 2.0 but works fine in spark 1.6.

these tables are populated with spark 1.6 HiveContext using INSERT INTOsyntax.

has anyone seen this issue? is there a config value that I need to set to get spark 2.0 work with Date fields in parquet format?

1
there seem to be a bug files for this already, (issues.apache.org/jira/browse/SPARK-17354) - farazZ

1 Answers

1
votes

In spark 2.0.0 this one fails at VectorizedParquetRecordReader class. For workaround you can execute below command before you read the data.

spark.sql("set spark.sql.parquet.enableVectorizedReader=false")