I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark). However, all the timestamp fields appear as 1970-01-19 10:45:37.009 . So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 . When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it? I didn't manage to find any spark.sql configurations that solved it.