I have parquet file with TimeStamp column in this format 2020-07-07 18:30:14.500000+00:00 written from pandas. When I'm reading the same parquet file in spark, it is being read as 2020-07-08 00:00:14.5.
I wanted to convert this into epoch timestamp in milliseconds which is this 1594146614500
I have tried using java datetime format
val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime
It;s converting but wrong value(1594146614005) instead of 1594146614500.
To make it correct I have to add dtformat.parse(r2.getAs[Long]("date_time").toString+"00").getTime .
Is there anyother cleaner approch than this ?
Any function available in spark to read it as milliseconds ?
update 1:
After using the below answer:
df.withColumn("timestamp", to_timestamp($"date_time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX")).withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint")).show()
+-------------+--------------------+-------------------+-------------+
|expected_time| original_time| timestamp| epoch|
+-------------+--------------------+-------------------+-------------+
|1597763904500|2020-08-18 20:48:...|2020-08-18 20:48:24|1597763904000|
|1597763905000| 2020-08-18 20:48:25|2020-08-18 20:48:25|1597763905000|
|1597763905500|2020-08-18 20:48:...|2020-08-18 20:48:25|1597763905000|
drawback is suppose if data is at 500ms granularity, then each timestamp has two same epoc timestamp which is not expected.
SimpleDateFormat. That class is notoriously troublesome and long outdated. Instead useLocalDateTimeandDateTimeFormatter, both from java.time, the modern Java date and time API. Also there is no way thatSimpleDateFormatcan parse2020-07-08 00:00:14.5. It supports only milliseconds, exactly three decimals on the seconds. - Ole V.V.