0
votes

I need to write a timestamp into parquet, then read it with Hive and Impala.

In order to write it, I tried eg

my.select(
 ...,
 unix_timestamp() as "myts"
 .write
 .parquet(dir)

Then to read I created an external table in Hive:

CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
  ...
  myts TIMESTAMP
) 

Doing so, I get the error

HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

I also tried to replaced the unix_timestamp() with

to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")

and same problem. In Impala, it returns me:

 Column type: TIMESTAMP, Parquet schema: optional int64 

Whereas timestamp are supposed to be int96. What is the correct way to write timestamp into parquet?

1
I found a related issue: jira.pentaho.com/browse/PDI-17275 looking for a workaround - Rolintocour
You could create the table first, then use SparkSQL to INSERT into that table, managing automatically the compatibility issues... No offense meant, but I believe the Spark code base would do a better job that your hasty "workaround". - Samson Scharfrichter

1 Answers

0
votes

Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.