A column "colA" in a dataframe contains integer values:
+-----+
| colA|
+-----+
| 1|
| 2|
| 1|
| 3|
+-----+
These integer values can be mapped to strings through a Redis dictionary:
+----+------+
| key| value|
+----+------+
| 1| a|
| 2| b|
| 3| c|
+----+------+
I need to create a new column "colB" which will contain the mapping of "colA" to string values as such:
+-----+-----+
| colA| colB|
+-----+-----+
| 1| a|
| 2| b|
| 1| a|
| 3| c|
+-----+-----+
My goal is to make batch requests to Redis, in order to avoid the latency of a single Redis request per row.
In Spark Core API (i.e. RDDs), I could do this, by using the mapPartitions function. Is there any way of achieving the same, by using the Spark SQL API?
Note that I want to avoid the overhead of:
- converting the dataframe to RDD and vice versa.
- the associated encoder by calling mapPartitions directly on the dataframe.