2
votes

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark Dataframe having 2 date column. But when I try to print the schema both column is populated as string type.

enter image description here

enter image description here

Above screenshot attached is a Dataframe and schema of the Dataframe.

How to convert the row values there in both the date column to timestamp format using pyspark?

I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.

I have tried below code as wellb but this is creating a new column with null value

df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())

enter image description here

1

1 Answers

0
votes

Since there are two different date types, you need to convert using two different date formats, and coalesce the results.

import pyspark.sql.functions as F

result = df.withColumn(
    'datetime',
    F.coalesce(
        F.to_timestamp('joining_date', 'MM-dd-yy'),
        F.to_timestamp('joining_date', 'MM/dd/yy')
    )
)

result.show()
+------------+-------------------+
|joining_date|           datetime|
+------------+-------------------+
|    01-20-20|2020-01-20 00:00:00|
|    01/19/20|2020-01-19 00:00:00|
+------------+-------------------+

If you want to convert all to a single format:

import pyspark.sql.functions as F

result = df.withColumn(
    'datetime',
    F.date_format(
        F.coalesce(
            F.to_timestamp('joining_date', 'MM-dd-yy'),
            F.to_timestamp('joining_date', 'MM/dd/yy')
        ),
        'MM-dd-yy'
    )
)

result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
|    01-20-20|01-20-20|
|    01/19/20|01-19-20|
+------------+--------+