1
votes

I have been using pyspark 3.0. I have a dataframe with a column 'time' in StringType. I am trying to convert this to timestamp. The dataframe looks like this.

+---------------+
|           time|
+---------------+
|10:59:46.000 AM|
| 6:26:36.000 PM|
|11:13:38.000 PM|
+---------------+

I tried both to_timestamp() and unix_timestamp.

df.withColumn("new_time", F.to_timestamp(col("time"),"hh:mm:ss.SSS a")).show()

.

df.withColumn('new_time', F.unix_timestamp(inputDF['time'], 'hh:mm:ss.SSS a').cast(TimestampType())).show()

The error I'm getting is this.

org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '6:26:36.000 PM' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

I want to know how it's done in Spark 3.0 without setting

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

Any help would be much appreciated. Thanks.

2

2 Answers

3
votes

Try this-

df.withColumn("new_time", F.to_timestamp(F.lpad(col("time"), 15, "0"),"hh:mm:ss.SSS a")).show()

some Explanation

1. lpad(column, length, "<string_to_be_padded>")-

This function checks the length of the string specified and will left pad with the string_to_be_padded if length of string < length specified.

Example

the input row 6:26:36.000 PM has only 14 chars and since the length specified is 15 it will left pad 0 (3rd argument) to make it of length 15. Now the o/p od lpad is 06:26:36.000 PM. this matches our format specified in to_timestamp

More explanation here

3
votes

No need for padding..

You need to change the format of conversion string.. Delete on "h" from that and it works then.

df.withColumn('new_time', f.unix_timestamp(df['Timestamp'], 'h:mm:ss.SSS a'))

explanation on the format:

'hh:mm:ss.SSS a'
01:00:00.000 pm
11:00:00.000 am

'h:mm:ss.SSS a'
1:00:00.000 pm
11:00:00.000 am