0
votes

I have two timestamps columns in a pyspark dataframe like so:

+--------------------+--------------------+
|          TIME_STAMP|         TIME_STAMP2|
+--------------------+--------------------+
|2020-01-03 12:58:...|                null|
|2020-01-03 12:59:...|                null|
|2020-01-03 13:01:...|                null|
|2020-01-03 13:02:...|                null|
|2020-01-03 13:04:...|                null|
|2020-01-03 13:05:...|                null|
|2020-01-03 13:07:...|                null|
|2020-01-03 13:08:...|2020-01-03 12:58:...|
|2020-01-03 13:10:...|2020-01-03 12:59:...|
|2020-01-03 13:11:...|2020-01-03 13:01:...|
|2020-01-03 13:13:...|2020-01-03 13:02:...|
|2020-01-03 13:14:...|2020-01-03 13:04:...|
|2020-01-03 13:16:...|2020-01-03 13:05:...|
|2020-01-03 13:17:...|2020-01-03 13:07:...|
|2020-01-03 13:19:...|2020-01-03 13:08:...|
|2020-01-03 13:20:...|2020-01-03 13:10:...|
|2020-01-03 13:22:...|2020-01-03 13:11:...|
|2020-01-03 13:23:...|2020-01-03 13:13:...|
|2020-01-03 13:24:...|2020-01-03 13:14:...|
|2020-01-03 13:26:...|2020-01-03 13:16:...|
+--------------------+--------------------+

I would like to find the difference, however, if one of the values is null, I am getting an error. Is there a way to cirvumvent this? This is the error i am getting:

An error was encountered: "cannot resolve '(TIME_STAMP - TIME_STAMP2)' due to data type mismatch: '(TIME_STAMP - TIME_STAMP2)' requires (numeric or calendarinterval) type, not timestamp;;

1
Could you show the code that gives the error? - mck

1 Answers

1
votes

You can cast the timestamp values to long and subtract them. You'll get the difference in seconds:

    from pyspark.sql import functions as f

    df.withColumn('diff_in_seconds', f.col('TIME_STAMP').cast('long') - f.col('TIME_STAMP2').cast('long'))
    df.show(10, False)

Note that if any of the values is "null" the result will be "null" as well.