Spark: write timestamp to parquet and read it from Hive / Impala

Question

I need to write a timestamp into parquet, then read it with Hive and Impala.

In order to write it, I tried eg

my.select(
 ...,
 unix_timestamp() as "myts"
 .write
 .parquet(dir)

Then to read I created an external table in Hive:

CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
  ...
  myts TIMESTAMP
)

Doing so, I get the error

HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

I also tried to replaced the unix_timestamp() with

to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")

and same problem. In Impala, it returns me:

 Column type: TIMESTAMP, Parquet schema: optional int64

Whereas timestamp are supposed to be int96. What is the correct way to write timestamp into parquet?

I found a related issue: jira.pentaho.com/browse/PDI-17275 looking for a workaround — Rolintocour
You could create the table first, then use SparkSQL to INSERT into that table, managing automatically the compatibility issues... No offense meant, but I believe the Spark code base would do a better job that your hasty "workaround". — Samson Scharfrichter

Rolintocour Rolintocour · Accepted Answer · 2018-09-14T17:00:38

Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.