3
votes

I have a csv of the form:

t,value
2012-01-12 12:30:00,4
2012-01-12 12:45:00,3
2012-01-12 12:00:00,12
2012-01-12 12:15:00,13
2012-01-12 13:00:00,7

I convert that into dataframe using spark-csv. (so t is in String type and value is in Integer type). What's the appropriate spark scala way so the output is sorted by time?

I was thinking to convert t to certain type which can allow dataframe sortBy. But I am not familiar which timestamp type allow dataframe sorting by time.

3

3 Answers

7
votes

Given the format you can either cast to timestamp to

import org.apache.spark.sql.types.TimestampType

df.select($"t".cast(TimestampType)) // or df.select($"t".cast("timestamp"))

to get proper date time or use unix_timestamp (Spark 1.5+, in Spark < 1.5 you can use a Hive UDF of the same name) function:

import org.apache.spark.sql.functions.unix_timestamp

df.select(unix_timestamp($"t"))

to get a numerical representation (Unix timestamp in seconds).

On a side note there is no reason you couldn't orderBy($"t") directly. Lexicographic order should work just fine here.

2
votes

Additionally to @zero323, if you are writing pure SQL you can use the CAST operator as follows:

df.registerTempTable("myTable")    
sqlContext.sql("SELECT CAST(t as timestamp) FROM myTable")
0
votes

If you cast using 'df.select', then you might get only the column specified. To change the type of the column specified, & retain other columns, apply 'df.withColumn' and pass the original column name.

import org.apache.spark.sql.types._

val df1 = df.withColumn("t",col("t").cast(TimestampType))

df1.printSchema
root
 |-- t: timestamp (nullable = true)
 |-- value: integer (nullable = true)

Only the datatype of the column name "t" is changed. Rest is retained.