1
votes

Is it possible to create Dataset from Dataframe column in Spark 2.0? I have a following issue: I want to read some data from parquet data partitioned by date and after that transform one of columns to Dataset. Exmaple:

val frame = spark.read.parquet(path).select($"date", $"object".as[MyObjectType]).filter($"date" > "2016-10-01")

Now, I need to transform second column to Dataset[MyObjectType] and don't understand how i can do this. MyObjectType is scala product type

1

1 Answers

2
votes

You can do cast:

val frame = spark.read.parquet(path)
    .select($"date", $"object".cast(MyObjectTypeUDT))
    .filter($"date" > "2016-10-01")

In this case, MyObjectTypeUDT is one of SQL types, i.e. StringType or IntegerType or custom UserDefinedType.

Or, if you have some class which represents content in Dataset:

case clas DateWithObject (date : Timestamp, object: MyObject)

Then you can write:

val frame = spark.read.parquet(path)
    .select($"date", $"object")
    .as[DateWithObject] 
    .filter($"date" > "2016-10-01")

I think it's the simplest way to do it