Using Pyspark to convert column from string to timestamp

Question

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.

Violation_Time          Time_First_Observed
0830A                   1600P
1450P                   0720A
1630P                   2540P
0900A                   0100A

I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.

sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()

rock321987 rock321987 · Accepted Answer · 2019-11-07T06:07:07

You can use the following

sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()

Explanation

sparkdf3.withColumn('timestamp', 
   func.split(
      func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
       , ' '
    ).getItem(1) #Split on space and get first item
)

Using Pyspark to convert column from string to timestamp

1 Answers