PySpark: filtering a DataFrame by date field in range where date is string

Question

My dataframes contains one field which is a date and it appears in the string format, as example

'2015-07-02T11:22:21.050Z'

I need to filter the DataFrame on the date to get only the records in the last week. So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:

def map_to_datetime(row):
     format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
     row.date = datetime.strptime(row.date, format_string)

df = df.map(map_to_datetime)

and then I would apply a filter as

df.filter(lambda row:
    row.date >= (datetime.today() - timedelta(days=7)))

I manage to get the mapping working but the filter fails with

TypeError: condition should be string or Column

Is there a way to use a filtering in a way that works or should I change the approach and how?

mar tin mar tin · Accepted Answer · 2016-03-20T18:40:01

I figured out a way to solve my problem by using the SparkSQL API with dates in String format.

Here is an example:

last_week = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')

new_df = df.where(df.date >= last_week)

PySpark: filtering a DataFrame by date field in range where date is string

2 Answers