0
votes

I have below data, I want to add ratingtimestamp column with formated timestamp. But I am getting null in it.

+------+-------+------+----------+---------------+
|userId|movieId|rating| ts       |ratingtimestamp|
+------+-------+------+----------+---------------+
|     1|    296|   5.0|1147880044|           null|
|     1|    306|   3.5|1147868817|           null|

code I am using for above change

val data2 = data.withColumn("ratingtimestamp", col("ts").cast(TimestampType))

2

2 Answers

1
votes

Use from_unixtime & date_format functions.

scala> val df = Seq(("1","296","5.0","1147880044","null"),("1","306","3.5","1147868817","null")).toDF("userId","movieId","rating","ts","ratingtimestamp")
df: org.apache.spark.sql.DataFrame = [userId: string, movieId: string ... 3 more fields]

scala> df.show(false)
+------+-------+------+----------+---------------+
|userId|movieId|rating|ts        |ratingtimestamp|
+------+-------+------+----------+---------------+
|1     |296    |5.0   |1147880044|null           |
|1     |306    |3.5   |1147868817|null           |
+------+-------+------+----------+---------------+
scala> df.withColumn("ratingtimestamp",date_format(from_unixtime($"ts"),"MM/dd/yyyy HH:mm:ss")).show(false)
+------+-------+------+----------+-------------------+
|userId|movieId|rating|ts        |ratingtimestamp    |
+------+-------+------+----------+-------------------+
|1     |296    |5.0   |1147880044|05/17/2006 21:04:04|
|1     |306    |3.5   |1147868817|05/17/2006 17:56:57|
+------+-------+------+----------+-------------------+
scala> df.withColumn("ratingtimestamp",from_unixtime($"ts","MM/dd/yyyy HH:mm:ss")).show(false)
+------+-------+------+----------+-------------------+
|userId|movieId|rating|ts        |ratingtimestamp    |
+------+-------+------+----------+-------------------+
|1     |296    |5.0   |1147880044|05/17/2006 21:04:04|
|1     |306    |3.5   |1147868817|05/17/2006 17:56:57|
+------+-------+------+----------+-------------------+
0
votes

I think, column ts is of type integer. try to convert it to long and then timestamp as below-

df.withColumn("ts_new", $"ts".cast("long").cast("timestamp"))

since the default format is - yyyy-MM-dd HH:mm:ss, you will see the output of show() in yyyy-MM-dd HH:mm:ss format.

You can convert the default format to new format like MM/dd/yyyy HH:mm:ss using -

from_unixtime($"ts_new".cast("long"),"MM/dd/yyyy HH:mm:ss")