0
votes

I am trying to convert a column containing date value in string format to timestamp format in Apache spark scala.

Below is the content of the dataframe(retailsNullRem):

+---------+---------+--------------+----------+
|InvoiceNo|StockCode|   InvoiceDate|customerID|
+---------+---------+--------------+----------+
|   536365|   85123A|12/1/2010 8:26|     17850|
|   536365|    71053|12/1/2010 8:26|     17850|
|   536365|   84406B|12/1/2010 8:26|     17850|
|   536365|   84029G|12/1/2010 8:26|     17850|
|   536365|   84029E|12/1/2010 8:26|     17850|
|   536365|    22752|12/1/2010 8:26|     17850|
|   536365|    21730|12/1/2010 8:26|     17850|
|   536366|    22633|12/1/2010 8:28|     17850|
|   536366|    22632|12/1/2010 8:28|     17850|
|   536367|    84879|12/1/2010 8:34|     13047|

"InvoiceDate" is the column that i am converting to timestamp. I tried the below code for the convertion.

val timeFmt = "MM/dd/yyyy HH:mm"
val retails = retailsNullRem
            .withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), timeFmt))

In the data source, it is mentioned that the date format is month/day/year hour:min. But the above code is returning 'Null' for InvoiceDateTS column. I even tried with format like ("%M/%d/%y %H:%m") as in some cases the month, day and hour didnot contain leading 0, but still getting null. Please guide me on what i am missing.

Below is the sample output:

+---------+---------+--------------+----------+-------------+
|InvoiceNo|StockCode|   InvoiceDate|customerID|InvoiceDateTS|
+---------+---------+--------------+----------+-------------+
|   536365|   85123A|12/1/2010 8:26|     17850|         null|
|   536365|    71053|12/1/2010 8:26|     17850|         null|
|   536365|   84406B|12/1/2010 8:26|     17850|         null|
|   536365|   84029G|12/1/2010 8:26|     17850|         null|
|   536365|   84029E|12/1/2010 8:26|     17850|         null|
|   536365|    22752|12/1/2010 8:26|     17850|         null|
|   536365|    21730|12/1/2010 8:26|     17850|         null|
|   536366|    22633|12/1/2010 8:28|     17850|         null|
|   536366|    22632|12/1/2010 8:28|     17850|         null|
|   536367|    84879|12/1/2010 8:34|     13047|         null|
2
are you sure that your input format matches the input data? I mean, MM/dd/yyyy HH:mm requires days to be two digits - the same for hours value - UninformedUser
@UninformedUser, Thanks for the response. I am a bit unsure here, on some row i have date with two digit days(like 10, 11, 12) also. The same case occurs with month where i can get single digit month (1, 2, 3 for jan, feb, mar) and also two digit hour ( like 21, 22, 23) . Not sure if this is the reason for the issue and how can i handle it. - vignesh

2 Answers

1
votes

I am not sure why it's not working i have tried below and it worked

import spark.implicits._

scala> val df=Seq("12/1/2010 8:26", "12/1/2010 8:29").toDF("t")
df: org.apache.spark.sql.DataFrame = [t: string]

scala> df.with
withColumn   withColumnRenamed   withWatermark

scala> df.withColumn
withColumn   withColumnRenamed

scala> df.withColumn("s",col("t").cast("timestamp")).show
+--------------+----+
|             t|   s|
+--------------+----+
|12/1/2010 8:26|null|
|12/1/2010 8:29|null|
+--------------+----+


scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> df.withColumn("s",to_timestamp(col("t"),"MM/dd/yyyy HH:mm")).show
+--------------+-------------------+
|             t|                  s|
+--------------+-------------------+
|12/1/2010 8:26|2010-12-01 08:26:00|
|12/1/2010 8:29|2010-12-01 08:29:00|
+--------------+-------------------+
0
votes

Maybe there is a problem with your file data. I tried the same with your own data and it works perfectly well, you can try with dataframe functions or sparkSQL.

your data file from: https://www.kaggle.com/carrie1/ecommerce-data/home#data.csv

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
536366,22632,HAND WARMER RED POLKA DOT,6,12/1/2010 8:28,1.85,17850,United Kingdom
536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/2010 8:34,1.69,13047,United Kingdom
536367,22745,POPPY'S PLAYHOUSE BEDROOM ,6,12/1/2010 8:34,2.1,13047,United Kingdom
536367,22748,POPPY'S PLAYHOUSE KITCHEN,6,12/1/2010 8:34,2.1,13047,United Kingdom
536367,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,8,12/1/2010 8:34,3.75,13047,United Kingdom
536367,22310,IVORY KNITTED MUG COSY ,6,12/1/2010 8:34,1.65,13047,United Kingdom
536367,84969,BOX OF 6 ASSORTED COLOUR TEASPOONS,6,12/1/2010 8:34,4.25,13047,United Kingdom

code in IntelliJ

      val df = sqlContext
        .read
        .option("header", true)
        .option("inferSchema", true)
        .csv("/home/cloudera/files/tests/timestamp.csv")
        .cache()

      df.show(5, truncate = false)
      df.printSchema()

      import org.apache.spark.sql.functions._
      // You can try this with dataframe functions
      val retails = df
        .withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), "MM/dd/yyyy HH:mm"))

      retails.show(5, truncate = false)
      retails.printSchema()

      // or sparkSQL
      df.createOrReplaceTempView("df")
      val retailsSQL = sqlContext.sql(
        """
          |SELECT InvoiceNo,StockCode,InvoiceDate,customerID, TO_TIMESTAMP(InvoiceDate,"MM/dd/yyyy HH:mm") AS InvoiceDateTS
          |FROM df
          |""".stripMargin)

      retailsSQL.show(5,truncate = false)
      retailsSQL.printSchema()

output

+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                       |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER|6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN               |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
only showing top 2 rows

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
|InvoiceNo|StockCode|Description                       |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |InvoiceDateTS      |
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER|6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|2010-12-01 08:26:00|
|536365   |71053    |WHITE METAL LANTERN               |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|2010-12-01 08:26:00|
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
only showing top 2 rows

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- InvoiceDateTS: timestamp (nullable = true)

+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate   |customerID|InvoiceDateTS      |
+---------+---------+--------------+----------+-------------------+
|536365   |85123A   |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
|536365   |71053    |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- customerID: integer (nullable = true)
 |-- InvoiceDateTS: timestamp (nullable = true)