6
votes

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.

Specifically, I have the following setup:

sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
                      ('b','2018-02-02','yyyy-MM-dd'),
                      ('c','02-02-2018','dd-MM-yyyy')]).toDF(
                    ["col_name","value","format"])

I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.

Separately for each format, this can be done with

df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
        withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))

This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:

df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))

Here an error "Column object not callable" is being thrown.

Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?

3

3 Answers

10
votes

You can use a column value as a parameter without a udf using the spark-sql syntax:

Spark version 2.2 and above

from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name|     value|    format|     test3|
#+--------+----------+----------+----------+
#|       a|2018-01-01|yyyy-MM-dd|2018-01-01|
#|       b|2018-02-02|yyyy-MM-dd|2018-02-02|
#|       c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+

Or equivalently using pyspark-sql:

df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show() 

Spark version 1.5 and above

Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:

from pyspark.sql.functions import expr
df.withColumn(
    "test3",
    expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()

Or equivalently using pyspark-sql:

df.createOrReplaceTempView("df")
spark.sql(
    "select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show() 
3
votes

As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...

test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
                  ('b','2018-02-02','%Y-%m-%d'),
                  ('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))

def map_to_date(s,format):
    return datetime.datetime.strptime(s,format).isoformat()

myudf = udf(map_to_date)

test_df.withColumn("test3",myudf(col("value"),col("format")))\ 
   .withColumn("test3",to_date("test3")).show(truncate=False)

Result:

+--------+----------+--------+----------+
|col_name|value     |format  |test3     |
+--------+----------+--------+----------+
|a       |2018-01-01|%Y-%m-%d|2018-01-01|
|b       |2018-02-02|%Y-%m-%d|2018-02-02|
|c       |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
0
votes

You dont need the format column also. You can use coalesce to check for all possible options

def get_right_date_format(date_string):
      from pyspark.sql import functions as F
      return F.coalesce(
                    F.to_date(date_string, 'yyyy-MM-dd'),
                    F.to_date(date_string, 'dd-MM-yyyy'),
                    F.to_date(date_string, 'yyyy-dd-MM')
      )

df = sc.parallelize([('a','2018-01-01'),
                      ('b','2018-02-02'),
                      ('c','2018-21-02'),
                      ('d','02-02-2018')]).toDF(
                    ["col_name","value"])

df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))

The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.

Just an alternative approach !!!