6
votes

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp

This is what I have tried so far

df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )

printSchema() shows end_time: string (nullable = true)

when I expended timestamp as the type of variable

4
Please include a minimal reproducible example with some small sample inputs and the desired output. How to create good reproducible spark examples. - pault

4 Answers

6
votes

Try using from_utc_timestamp:

from pyspark.sql.functions import from_utc_timestamp

df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST')) 

You'd need to specify a timezone for the function, in this case I chose PST

If this does not work please give us an example of a few rows showing df.end_time

5
votes

Create a sample dataframe with Time-stamp formatted as string:

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()

Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)

Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

Converting unix_timestamp(double) to timestamp datatype in Spark.

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)

This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+

Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)
3
votes

in current version of spark , we do not have to do much with respect to timestamp conversion.

using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column. in my case it was in format yyyy-MM-dd HH:mm:ss. other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.

from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
2
votes

Following might help:-

from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))

[Updated]