I have parquet file with TimeStamp column in this format 2020-07-07 18:30:14.500000+00:00
written from pandas. When I'm reading the same parquet file in spark, it is being read as 2020-07-08 00:00:14.5
.
I wanted to convert this into epoch timestamp in milliseconds which is this 1594146614500
I have tried using java datetime format
val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime
It;s converting but wrong value(1594146614005) instead of 1594146614500.
To make it correct I have to add dtformat.parse(r2.getAs[Long]("date_time").toString+"00").getTime
.
Is there anyother cleaner approch than this ?
Any function available in spark to read it as milliseconds ?
update 1:
After using the below answer:
df.withColumn("timestamp", to_timestamp($"date_time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX")).withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint")).show()
+-------------+--------------------+-------------------+-------------+
|expected_time| original_time| timestamp| epoch|
+-------------+--------------------+-------------------+-------------+
|1597763904500|2020-08-18 20:48:...|2020-08-18 20:48:24|1597763904000|
|1597763905000| 2020-08-18 20:48:25|2020-08-18 20:48:25|1597763905000|
|1597763905500|2020-08-18 20:48:...|2020-08-18 20:48:25|1597763905000|
drawback is suppose if data is at 500ms granularity, then each timestamp has two same epoc timestamp which is not expected.
SimpleDateFormat
. That class is notoriously troublesome and long outdated. Instead useLocalDateTime
andDateTimeFormatter
, both from java.time, the modern Java date and time API. Also there is no way thatSimpleDateFormat
can parse2020-07-08 00:00:14.5
. It supports only milliseconds, exactly three decimals on the seconds. – Ole V.V.