1
votes

I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp

So the input is :

date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp

country = "FR" # Type is string

import pytz
import pandas as pd

def convert_date_spark(date, country):
    timezone = pytz.country_timezones(country)[0]

    local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone)
    date, time = local_time.date(), local_time.time()

    return pd.Timestamp.combine(date, time)

# Then i'm creating an UDF to give it to spark

convert_date_udf = udf(lambda x, y : convert_date_spark(x, y), TimestampType())

Then I use it in the function that feeds spark :

data = data.withColumn("date", convert_date_udf(data["date"], data["country"]))

I got the following error :

TypeError: tzinfo argument must be None or of a tzinfo subclass, not type 'str'

The expected output is the date with the same format

As tested with python, the _convert_date_spark_ functions works but this is not working in pyspark

enter image description here

Could you please help me finding a solution for this ?

Thanks

1

1 Answers

2
votes

Use tzinfo instance and not string as timezone.

>>> timezone_name = pytz.country_timezones(country)[0]
>>> timezone_name
'Europe/Paris'
>>> timezone = pytz.timezone(timezone_name)
>>> timezone
<DstTzInfo 'Europe/Paris' LMT+0:09:00 STD>
>>>