Spark dataframe withColumn on partitions

Question

A column "colA" in a dataframe contains integer values:

+-----+
| colA|
+-----+
|    1|
|    2|
|    1|
|    3|
+-----+

These integer values can be mapped to strings through a Redis dictionary:

+----+------+
| key| value|
+----+------+
|   1|     a|
|   2|     b|
|   3|     c|
+----+------+

I need to create a new column "colB" which will contain the mapping of "colA" to string values as such:

+-----+-----+
| colA| colB|
+-----+-----+
|    1|    a|
|    2|    b|
|    1|    a|
|    3|    c|
+-----+-----+

My goal is to make batch requests to Redis, in order to avoid the latency of a single Redis request per row.

In Spark Core API (i.e. RDDs), I could do this, by using the mapPartitions function. Is there any way of achieving the same, by using the Spark SQL API?

Note that I want to avoid the overhead of:

converting the dataframe to RDD and vice versa.
the associated encoder by calling mapPartitions directly on the dataframe.

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-02-05T16:53:54

Note that I want to avoid the overhead of:

...

the associated encoder by calling mapPartitions directly on the dataframe.

This actually makes it impossible. Any operation which doesn't use SQL DSL, requires decoding to external type and encoding back to internal type. With primitive values it is low cost operation, if binary encoder is used, but it still requires Encoder.

Spark dataframe withColumn on partitions

1 Answers