0
votes

My data is in pyspark dataframe ('pyspark.sql.dataframe.DataFrame'). One of the column has date time stored in Twitter string format.

I found a couple of solutions for python but there is no specific solution for pyspark.

This is how the column looks like.

+------------------------------+----+
|created_at(string format)     |date|
+------------------------------+----+
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
|Tue Mar 26 02:29:54 +0000 2019|null|
+------------------------------+----+

I tried following solution but it didn't work

date_df = df.select('created_at', from_unixtime(unix_timestamp('created_at', '%a %b %d %H:%M:%S %z %Y')).alias('date'))

I need to convert the column into spark datetime/timestamp type so I can perfrom other datetime and spark.sql operations on top of it.

4
Can't you just use the python solution with a PySpark udfsramalingam24
@sramalingam24 Python function is likely to return python datetime object which isn't a compatible datatype in spark. So I'm not sure it will work.Harvey

4 Answers

0
votes

I couldn't get any of the solution provided working. Apparently the to_timestamp() function on pyspark.sql reads the input in a certain format. All attempts of provide the format yielded no results. Hence using UDF (user defined function) was inevitable. But one has to note that the function cannot return python datetime. So now it has to be a two step process.

  1. UDF for parsing the date string, convert the format and return a string compatible with to_timestamp()
  2. Using to_timestamp() to convert the datatype in spark dataframe
from datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp

## Converting date string format
def getDate(x):
    if x is not None:
        return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
    else:
        return None

## UDF declaration
date_fn = udf(getDate, StringType())

## Converting datatype in spark dataframe
df = df.withColumn("created_at", to_utc_timestamp(date_fn("created_at"),"UTC")) 
0
votes

Use the SimpleDateFormat to build the timestamp format: try using the below solution this worked for me.

date_df = df.select('created_at', from_unixtime(
    unix_timestamp(col("created_at"),
                        "EEE MMM dd HH:mm:ss ZZZZ yyyy")).alias('date'))
0
votes

In response to pardeep's answer, a slight modification did the trick for me.

unix_timestamp("created_at", "EEE MMM dd HH:mm:ss Z yyyy")
0
votes
data_df=spark.createDataFrame([(1,'Mon Oct 05 23:18:25 -0700 2020'),(2,'Tue Oct 06 23:18:25 -0700 2020')], ['srno','created_at'])

data_df.show(2,False)

enter image description here

data_df.printSchema()

enter image description here

date_df = data_df.select('created_at', from_unixtime(unix_timestamp('created_at', 'EEE MMM d HH:mm:ss z yyyy')).alias('date'))

date_df.show(2,False)

enter image description here

We can use to_timestamp function also in place of using unix_timestamp and the from_unixtime.

data_df.select('created_at', to_timestamp('created_at', 'EEE MMM d HH:mm:ss z yyyy').alias('date')).show(2,False)

data_df.select('created_at', from_unixtime(unix_timestamp('created_at', 'EEE MMM d HH:mm:ss z yyyy')).alias('date')).show(2,False)

enter image description here