0
votes

I'm interested in taking a column in my dataframe called mapColumn

+-------------------+
|   mapColumn       |
 +-------------------+
| Map(KEY -> VALUE) |
 +-------------------+

and create a stringColumn that's just the key and value of the Map column where the value is "KEY,VALUE":

+-------------------+
|   stringColumn    |
 +-------------------+
| KEY,VALUE         |
 +-------------------

I have tried creating a UDF to pass this value like follows:

var getStringColumn = udf(mapToString _)

df.withColumn("stringColumn,
               when(col(mapColumn).isNotNull,
                    getStringColumn(col(mapColumn)))
                    .otherwise(lit(null: String)))

def mapToString(row: Row): String = {
    if (null == row || row.isNullAt(FirstItemIndex)) {
        return null
    }
    return row.getValuesMap[Any](row.schema.fieldNames).mkString(",")
    }

I keep getting the following error:

Failed to execute user defined function($anonfun$1: (map) => string) Cause: java.lang.ClassCastException:scala.collection.immutable.Map$Map1 cannot be cast to org.apache.spark.sql.Row

1

1 Answers

0
votes

There is no need for a UDF. One approach is to explode the Map column into flattened key & value columns and concat the key-value elements as Strings accordingly:

val df = Seq(
  (10, Map((1, "a"), (2, "b"))),
  (20, Map((3, "c")))
).toDF("id", "map")

df.
  select($"id", explode($"map")).
  withColumn("kv_string", concat($"key".cast("string"), lit(","), $"value")).
  show
// +---+---+-----+---------+
// | id|key|value|kv_string|
// +---+---+-----+---------+
// | 10|  1|    a|      1,a|
// | 10|  2|    b|      2,b|
// | 20|  3|    c|      3,c|
// +---+---+-----+---------+