rdd = sc.parallelize( [(['a','b','c'], 'c'), \
(['h','j','s'], 'j'), \
(['w','x','a'], 'a'), \
(['o','b','e'], 'c')] )
df = spark.createDataFrame(rdd, ['seq','target'])
+---------+------+
| seq|target|
+---------+------+
|[a, b, c]| c|
|[h, j, s]| j|
|[w, x, a]| a|
|[o, b, e]| c|
+---------+------+
I want to write an UDF to remove the target from seq.
+---------+------+---------+
| seq|target| filtered|
+---------+------+---------+
|[a, b, c]| c| [a, b]|
|[h, j, s]| j| [h, s]|
|[w, x, a]| a| [w, x]|
|[o, b, e]| c|[o, b, e]|
+---------+------+---------+
Note that this is just a showcase. The practical case is more complex. I want to get the formal way to process one column, such as seq
by using another column, such as target
, as a parameter.
Any generalized solution?