Background is a simply pyspark programme that I developed on 1.6 using databricks csv read/writer, and all was happy. My dataframe had a timestamp column, which was written out in a standard YYYY-MM-DD HH24:MI:SS
format.
foo,bar,2016-10-14 14:30:31.985
Now I’m running it on EMR with Spark 2, and the timestamp column is being written as an epoch in microseconds. This causes a problem because the target (Redshift) can’t natively handle this (only seconds or milliseconds).
foo,bar,1476455559456000
Looking at the docs, it seems I should be able to specify the format used with timestampFormat
, but I just get an error :
TypeError: csv() got an unexpected keyword argument 'timestampFormat'
Am I calling this wrong, or does the option not exist? Any other way to cleanly get my timestamp data out in a format that's not microseconds (milli would be fine, or any other standard time format really)
Simple code to reproduce:
df = sqlContext.createDataFrame([('foo','bar')]).withColumn('foo',pyspark.sql.functions.current_timestamp())
df.printSchema()
df.show()
# Use the new Spark 2 native method
df.write.csv(path='/tmp/foo',mode='overwrite')
# Use the databricks CSV method, pre Spark 2
df.write.save(path='/tmp/foo2',format='com.databricks.spark.csv',mode='overwrite')