1
votes

I have a dataframe in pyspark which I read as follow:

df = spark.table('db.table')
            .select(F.col('key').alias('key_a'),
                    F.to_date('move_out_date', 'yyyyMMdd').alias('move_out_date'))

Now I want to compare the move_out_date column with a date which is 20151231. But the code below isn't working

from pyspark.sql import functions as F

df.filter(F.datediff(F.col('move_out_date'), F.to_date('20151231', 'yyyyMMdd')) > 0)

How do you compare to_date columns with one single value?

1
From code diff looks like you need records greater than your date df.filter(df.move_out_date > f.to_date(f.lit('20151231'), 'yyyyMMdd')) SMaZ

1 Answers

1
votes

It's because to_date() expects a column as parameter and you are passing a string. To solve add lit() as parameter and the date that you want compare.

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark

df = spark.createDataFrame([
    Row(key=1, date='20151231'),
    Row(key=2, date='20160101'),
    Row(key=3, date='20160102')
])

df = df.select(f.col('key').alias('key_a'),
               f.to_date(f.col('date'), 'yyyyMMdd').alias('move_out_date'))

df = df.filter(f.datediff(f.col('move_out_date'), f.to_date(f.lit('20151231'), format='yyyyMMdd')) > 0)

df.show()

Output:

+-----+-------------+
|key_a|move_out_date|
+-----+-------------+
|    2|   2016-01-01|
|    3|   2016-01-02|
+-----+-------------+