0
votes

I have the input data comes in below format which is in dataframe df_date:

col1, col2, extraction_date, col3
010, DSL, 20191201235900, VDRAC
010, DSL, 20191203235900, VDRAC
010, DSL, 20191205235900, VDRAC
010, DSL, 20200120235900, VDRAC
010, DSL, 20200128235900, VDRAC
010, DSL, 20200129235900, VDRAC
010, DSL, 20200129235900, VDRAC
(string, string, bitint(longtype), string) # I have added the data type of each column for reference)

When I want to process old date, consider only the records before 29th. I need to filter it and need to apply business condition.

import datetime
var = '28-01-2020'
ref_date = datetime.datetime.strptime(var, '%d-%m-%Y').date()   #converting input ref date arg to date format
df_fil_date = df_data.filter(df_date.extraction_date.leq(ref_date))

Is showing me error, because the extraction_date from source is long_type (bitint) and the ref_date variable is in date format.

Could you please check and let me know how to filter the data based on the date variable passed?

1

1 Answers

1
votes

You will need to cast the column to date type to do any kinds of checks with other date columns.

clm_name = 'extraction_date'
input_format = 'yyyyMMddHHmmss'
outputformat = 'MM-dd-yyyy HH:mm:ss'
df_data = df_data.withColumn(clm_name, (unix_timestamp( df_data[extraction_date], input_format )))
df_data = df_data.withColumn(clm_name, from_unixtime( df_data[extraction_date], outputformat))
df_data = df_data.withColumn(clm_name, to_timestamp( df_data[clm_name], outputformat).cast('date'))

Instead of casting to 'date' you could use 'timestamp' instead so that the time information is not lost during casting, but then again you will need the other column to be of type 'timestamp'. To filter the date,

var = '2020-01-28'
df_fil_date = df_data.filter(df_date.extraction_date < lit(var).cast('date'))