Rounding hours of datetime in PySpark

Question

I'm trying to round hours using pyspark and udf.

The function works properly in python but not well when using pyspark.

The input is :

date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp

def time_feature_creation_spark(date):
    return date.round("H").hour

time_feature_creation_udf = udf(lambda x : time_feature_creation_spark(x), IntegerType())

Then I use it in the function that feeds spark :

data = data.withColumn("hour", time_feature_creation_udf(data["date"])

And the error is :

TypeError: 'Column' object is not callable

The expected output is just the closest hour from the time in the datetime (e.g. 20h45 is closest to 21h, so returns 21)

LN_P LN_P · Accepted Answer · 2020-02-20T20:25:41

A nicer version than /3600*3600 is using the built-in function date_trunc

import pyspark.sql.functions as F
return df.withColumn("hourly_timestamp", F.date_trunc("hour", df.timestamp))

other formats besides hour are

year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’

Rounding hours of datetime in PySpark

2 Answers