0
votes

I am trying to convert a string column having dates in some format(YYYY-mm-dd) into a date/timestamp column. But not able to do so by using sql's str_to_date function.

I have followed this Convert date from String to Date format in Dataframes, But it works for a single value, instead i want the date timestamp conversion to be applied over all rows of a column, so i tried the below

sparkDF = spark.createDataFrame([('a', '2018-01-20'), ('b', '2018-01-21'), ('c', '2018-01-22')], ['name','enroll_DATE'])
>>> sparkDF.show()
+----+-----------+
|name|enroll_DATE|
+----+-----------+
|   a| 2018-01-20|
|   b| 2018-01-21|
|   c| 2018-01-22|
+----+-----------+

>>> sparkDF.printSchema()
root
|-- name: string (nullable = true)
|-- enroll_DATE: string (nullable = true)

sparkDF.registerTempTable('data')
res = spark.sql("select name, enroll_DATE,  TO_DATE(CAST(UNIX_TIMESTAMP(enroll_DATE, '%Y-%m-%d') as TIMESTAMP)) as converted_ED from data ")  

But getting all values for the converted_ED column as null.

There is obviously better ways to handle from DataFrame operations, but i have limitation(or constraint you can say) to do this in sparkDataframe transformation way. Basically spark.sql is mandatory for me.

As per the link, if it is possible to do the same for a single date_string value, so there might be a way already exist(i am unaware of) for doing the same over a column.

Please suggest. Thanks.

Edit

By changing the format string from '%Y-%m-%d' to 'YYYY-mm-dd', the conversion worked but in a weird way(two dates are different!!!!)

 res = spark.sql("SELECT name, enroll_DATE, TO_DATE(CAST(UNIX_TIMESTAMP(enroll_DATE, 'YYYY-mm-dd') AS TIMESTAMP)) AS newdate from data")
>>> res.show()
+----+-----------+----------+
|name|enroll_DATE|   newdate|
+----+-----------+----------+
|   a| 2018-01-20|2017-12-31|   ### two dates are different
|   b| 2018-01-21|2017-12-31|
|   c| 2018-01-22|2017-12-31|
+----+-----------+----------+

>>> res.printSchema()
root
|-- name: string (nullable = true)
|-- enroll_DATE: string (nullable = true)
|-- newdate: date (nullable = true)

But still looking for doing the same for '%Y-%d-%m' format too(if there is any solution.).

2
can you please share the format of your enroll_DATE column sample data will help alotAnkit Kumar Namdeo
can you show you data ? if your string is YYYY-MM-DD, you can directly convert it to date technicallySteven
The format is in '%Y-%m-%d' or YYYY-mm-dd ex: 2018-01-22. I had already mentioned the format inside spark.sql query. For more clarity added few samples too. Thanks.Satya
@Steven: As i mentioned there is a constraint(from design flow) that i cannot use sparkDF transformation over the enroll_DATE column to cast it to timestamp type in my code, the column has to be casted inside the spark.sql query expression.Satya

2 Answers

0
votes

Below works only for 'YYYY-mm-dd' format, but failed for other formats like dd/mm/YYYY

 res = spark.sql("SELECT name, enroll_DATE, TO_DATE(CAST(enroll_DATE AS TIMESTAMP)) AS newdate from data")
>>> res.show()
+----+-----------+----------+
|name|enroll_DATE|   newdate|
+----+-----------+----------+
|   a| 2018-01-20|2018-01-20|
|   b| 2018-01-21|2018-01-21|
|   c| 2018-01-22|2018-01-22|
+----+-----------+----------+

>>> res.printSchema()
root
|-- name: string (nullable = true)
|-- enroll_DATE: string (nullable = true)
|-- newdate: date (nullable = true)
0
votes
spark.sql("SELECT name, enroll_date, TO_DATE(CAST(UNIX_TIMESTAMP(enroll_date, 'y-M-d') AS TIMESTAMP)) AS newdate from data")

unix_timestamp in spark

Pattern for unix_timestamp