Adding a column on row based operations in PySpark

Question

I have a Spark DataFrame where I am trying to create a new column based upon previous columns, but the tough part for me is that I have compute the value of the column row-wise. For example:

col1 |col2 |col3

1 | 2 | 3

4 | 5 | 0

3 | 1 | 1

So, I want a new column which has name of the column of the expression
max(col1, col2, col3) per row. So, desired output:

col1 |col2 |col3 |col4

1 | 2 | 3 | 'col3'

4 | 5 | 0 | 'col2'

3 | 1 | 1 | 'col1'

Anyway it is possible to do in PySpark?

What would your desired output look like for the data you have posted? — Katya Handler
The original problem statement takes care of this condition. No two columns per row have the same data. — Hemant

Katya Handler Katya Handler · Accepted Answer · 2016-06-15T15:54:58

This is not an ideal answer, because it forces you to go back to RDDs. If I find a better one that allows you to stay in the DataFrame universe, I will update my answer. But this should work, for now.

a = sc.parallelize([[1,2,3],[4,5,0],[3,1,1]])
headers = ["col1", "col2", "col3"]

b = a.map(lambda x: (x[0], x[1], x[2], headers[x.index(max(x))]))

b.toDF(headers.append("max_col")).show()

This basically allows you use the max operation from python by iterating through your RDD on a row basis. It then finds the right column by indexing the list of headers.

Again, I'm not sure this is the best way and I hope to find a better one.

Adding a column on row based operations in PySpark

1 Answers