0
votes

I have the following scenarios:

case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])


val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))

val df = entities.toDF()

// df.show
+---+--------------------+
| id|                attr|
+---+--------------------+
|  1|[[name,sasha], [d...|
|  2|        [[home,hyd]]|
+---+--------------------+

//df.printSchema
root
 |-- id: string (nullable = true)
 |-- attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
      |    |    |-- key: string (nullable = true)
      |    |    |-- value: string (nullable = true) 

what I want to produce is

+---+--------------------+-------+
| id|  name              |  home |
+---+--------------------+-------+
|  1| sasha              |del    |
|  2| null               |hyd    |
+---+--------------------+-------+

How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.

My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.

I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.

Similar things I found:

Spark - convert Map to a single-row DataFrame

Split 1 column into 3 columns in spark scala

Spark dataframe - Split struct column into 2 columns

1

1 Answers

2
votes

That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>

import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}

val dfMap = df.withColumn("attr", map_from_entries($"attr"))

then it's just a matter of finding the unique keys

val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect

then selecting from the map

val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
|  1|sasha| del|
|  2| null| hyd|
+---+-----+----+

Less efficient but more concise variant is to explode and pivot

val result = df
  .select($"id", explode(map_from_entries($"attr")))
  .groupBy($"id")
  .pivot($"key")
  .agg(first($"value"))

result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
|  1| del|sasha|
|  2| hyd| null|
+---+----+-----+

but in practice I'd advise against it.