0
votes

We recently upgraded our server from CDH 5 to CDH 6 . When inserting data to TIMESTAMP columns using SPARK in parquet tables there is difference how data is inserted.

CDH 5:

HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from Hive value is '2019-01-30 00:00:00 0'

CDH 6:

HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from HIVE value is '2019-01-30 04:00:00'

IMPALA:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from IMPALA value is '2019-01-30 04:00:00'

Please let me know if there is any spark properties we can use . My primary goal is to match HIVE value in CDH5 vs CDH6 and If possible when we select from IMPALA if should be 2019-01-30 00:00:00'

1
Can you maybe tell what version of Spark you were on then and is on now? Also, as far as I know parquet stores timestamps in UTC, so it could be a presentation layer adjustment.mazaneicha
We are using Spark 2.3 versionuser1024962
Maybe you'll find this info useful docs.cloudera.com/runtime/7.2.1/developing-spark-applications/…. By the way, I believe the default Spark version in CDH6 is 2.4.mazaneicha
Thanks for sending link. That was or impala . My main issue is date mismatch in HIVE not sure if there is any setting when compared CDH5 vs CDH6user1024962

1 Answers

1
votes

To skip issues with data type between Spark and Hive the convention used by Spark to write Parquet data is configurable.

This is determined by the property spark.sql.parquet.writeLegacyFormat. The default value is false. If set to true, Spark will use the same convention as Hive for writing the Parquet data.

val spark = SparkSession
    .builder()
    .appName("MyApp")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .config("spark.sql.parquet.writeLegacyFormat", true)