2
votes

Background is a simply pyspark programme that I developed on 1.6 using databricks csv read/writer, and all was happy. My dataframe had a timestamp column, which was written out in a standard YYYY-MM-DD HH24:MI:SS format.

foo,bar,2016-10-14 14:30:31.985 

Now I’m running it on EMR with Spark 2, and the timestamp column is being written as an epoch in microseconds. This causes a problem because the target (Redshift) can’t natively handle this (only seconds or milliseconds).

foo,bar,1476455559456000

Looking at the docs, it seems I should be able to specify the format used with timestampFormat, but I just get an error :

TypeError: csv() got an unexpected keyword argument 'timestampFormat'

Am I calling this wrong, or does the option not exist? Any other way to cleanly get my timestamp data out in a format that's not microseconds (milli would be fine, or any other standard time format really)


Simple code to reproduce:

df = sqlContext.createDataFrame([('foo','bar')]).withColumn('foo',pyspark.sql.functions.current_timestamp())
df.printSchema()
df.show()

# Use the new Spark 2 native method
df.write.csv(path='/tmp/foo',mode='overwrite')

# Use the databricks CSV method, pre Spark 2
df.write.save(path='/tmp/foo2',format='com.databricks.spark.csv',mode='overwrite')
1

1 Answers

0
votes

Turns out the docs I was look at were for 2.0.1, whereas I was running on 2.0.0 -- and timestampFormat is new in 2.0.1.