spark scala dataframe timestamp conversion sorting?

Question

I have a csv of the form:

t,value
2012-01-12 12:30:00,4
2012-01-12 12:45:00,3
2012-01-12 12:00:00,12
2012-01-12 12:15:00,13
2012-01-12 13:00:00,7

I convert that into dataframe using spark-csv. (so t is in String type and value is in Integer type). What's the appropriate spark scala way so the output is sorted by time?

I was thinking to convert t to certain type which can allow dataframe sortBy. But I am not familiar which timestamp type allow dataframe sorting by time.

zero323 zero323 · Accepted Answer · 2015-12-22T04:51:42

Given the format you can either cast to timestamp to

import org.apache.spark.sql.types.TimestampType

df.select($"t".cast(TimestampType)) // or df.select($"t".cast("timestamp"))

to get proper date time or use unix_timestamp (Spark 1.5+, in Spark < 1.5 you can use a Hive UDF of the same name) function:

import org.apache.spark.sql.functions.unix_timestamp

df.select(unix_timestamp($"t"))

to get a numerical representation (Unix timestamp in seconds).

On a side note there is no reason you couldn't orderBy($"t") directly. Lexicographic order should work just fine here.

spark scala dataframe timestamp conversion sorting?

3 Answers