1
votes

I'm trying to cast a rfc2822 datetime column to a timestamp column. if i'm working with the variable outside a dataframe it's worked. But in a dataframe I receive an error message

My imports:

from pyspark.sql.types import *
from pyspark.sql.column import *
from pyspark.sql.functions import *
from email.utils import parsedate_to_datetime

Working outside the dataframe this is the code:

datestr = "Thu Sep 12 2019 15:58:30 GMT-0500 (hora estándar de Colombia)"
print(parsedate_to_datetime(datestr))

Output:

2019-09-12 15:58:30

But, if i'm working with this dataframe:

df =
spark.createDataFrame(["Thu Sep 12 2019 15:58:30 GMT-0500 (hora estándar de Colombia)"], "string",).toDF("Date")

And try to create another column with the following code:

df2 = df.withColumn("timestamp", parsedate_to_datetime(col("Date")))

I receive the error Message:

"Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

1

1 Answers

1
votes

Register parsedate_to_datetime as a UDF to allow it to interop with Spark's data types:

>>> from pyspark.sql.types import *
>>> from pyspark.sql.column import *
>>> from pyspark.sql.functions import *
>>> from email.utils import parsedate_to_datetime
>>> df = spark.createDataFrame(["Thu Sep 12 2019 15:58:30 GMT-0500 (hora estándar de Colombia)"], "string",).toDF("Date")
>>> parsedate_to_datetime_udf = udf(parsedate_to_datetime, TimestampType())
>>> df2 = df.withColumn("timestamp", parsedate_to_datetime_udf(col("Date")))
>>> df2.show()
+--------------------+-------------------+
|                Date|          timestamp|
+--------------------+-------------------+
|Thu Sep 12 2019 1...|2019-09-12 15:58:30|
+--------------------+-------------------+