3
votes

How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?

The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.

Any suggestion on how I could approach that using preferably Spark SQL or PySpark?

Thanks

1

1 Answers

2
votes

The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.

So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.

SparkSQL

spark.sql(
    """select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX") 
    AS epoch"""
).show(truncate=False)
#+----------+
#|epoch     |
#+----------+
#|1566543477|
#+----------+

Spark DataFrame

from pyspark.sql.functions import unix_timestamp

df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
    "unixtime", 
    unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp                |unixtime  |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+

Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.