0
votes

I have parquet file with TimeStamp column in this format 2020-07-07 18:30:14.500000+00:00 written from pandas. When I'm reading the same parquet file in spark, it is being read as 2020-07-08 00:00:14.5.

I wanted to convert this into epoch timestamp in milliseconds which is this 1594146614500

I have tried using java datetime format

val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime

It;s converting but wrong value(1594146614005) instead of 1594146614500.

To make it correct I have to add dtformat.parse(r2.getAs[Long]("date_time").toString+"00").getTime . Is there anyother cleaner approch than this ?

Any function available in spark to read it as milliseconds ?

update 1:

After using the below answer:

df.withColumn("timestamp", to_timestamp($"date_time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX")).withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint")).show()

+-------------+--------------------+-------------------+-------------+
|expected_time|       original_time|          timestamp|        epoch|
+-------------+--------------------+-------------------+-------------+
|1597763904500|2020-08-18 20:48:...|2020-08-18 20:48:24|1597763904000|
|1597763905000| 2020-08-18 20:48:25|2020-08-18 20:48:25|1597763905000|
|1597763905500|2020-08-18 20:48:...|2020-08-18 20:48:25|1597763905000|

drawback is suppose if data is at 500ms granularity, then each timestamp has two same epoc timestamp which is not expected.

2
I recommend you don’t use SimpleDateFormat. That class is notoriously troublesome and long outdated. Instead use LocalDateTime and DateTimeFormatter, both from java.time, the modern Java date and time API. Also there is no way that SimpleDateFormat can parse 2020-07-08 00:00:14.5. It supports only milliseconds, exactly three decimals on the seconds.Ole V.V.

2 Answers

1
votes

I recommend you switch from the outdated error-prone date/time API from the java.util and the corresonding formatting API (java.text.SimpleDateFormat) to the modern date/time API from java.time and the corresponding formatting API (java.time.format). Learn more about the modern date-time API from Trail: Date Time

import java.time.OffsetDateTime;
import java.time.format.DateTimeFormatter;

public class Main {
    public static void main(String[] args) {
        OffsetDateTime odt = OffsetDateTime.parse("2020-07-07 18:30:14.500000+00:00",
                DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ss.SSSSSSZZZZZ"));
        System.out.println(odt.toInstant().toEpochMilli());
    }
}

Output:

1594146614500
1
votes

With the spark dataframe functions,

df.withColumn("timestamp", to_timestamp($"time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX"))
  .withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint"))
  .show(false)

+--------------------------------+---------------------+-------------+
|time                            |timestamp            |epoch        |
+--------------------------------+---------------------+-------------+
|2020-07-07 18:30:14.500000+00:00|2020-07-07 18:30:14.5|1594146614500|
+--------------------------------+---------------------+-------------+

this is also possible way to do that.