Conversion incompatibility between timestamp type in Glue and in Spark?

Question

I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark). However, all the timestamp fields appear as 1970-01-19 10:45:37.009 . So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.

I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.

However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.

But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 . When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.

Is there a way to overcome it? I didn't manage to find any spark.sql configurations that solved it.

dsk dsk · Accepted Answer · 2020-07-07T15:08:25

You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp

from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()

While converting : 1593938489000 I was getting below

timestamp_col_original|      timestamp_col|
+----------------------+-------------------+
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+

Conversion incompatibility between timestamp type in Glue and in Spark?

1 Answers