pyspark dataframe modifying columns

Question

I have input dataframe as below, where input columns are dynamic, i.e. it can be n number - like input1 to input2

+----+----+-------+------+------+
|dim1|dim2|  byvar|input1|input2|
+----+----+-------+------+------+
| 101| 102|MTD0001|     1|    10|
| 101| 102|MTD0002|     2|    12|
| 101| 102|MTD0003|     3|    13|

Wanted to modify the columns as below, how is it possible?

+----+----+-------+----------+------+
|dim1|dim2|  byvar|TRAMS_NAME|values|
+----+----+-------+----------+------+
| 101| 102|MTD0001|    input1|     1|
| 101| 102|MTD0001|    input2|    10|
| 101| 102|MTD0002|    input1|     2|
| 101| 102|MTD0002|    input2|    12|
| 101| 102|MTD0003|    input1|     3|
| 101| 102|MTD0003|    input2|    13|

i have used create_map spark method, but it is hard coded way of doing it. Any other way to achieve the same ??

why r first 2 rows of TRAMS_NAME input1 and input2 and the rest are value1 and value 2? shouldnt they all be value1 or value2? — murtihash
Yaa it is my mistake. I have updated please check.@Mohammad Murtaza Hashmi — Rocky1989

lalfab lalfab · Accepted Answer · 2020-04-11T18:56:25

Here is another solution to your problem using stack() function. It may a little simpler, of course with the limitation that you must explicitly put the column names.

Hope this helps!

# set your dataframe
df = spark.createDataFrame(
    [(101, 102, 'MTD0001', 1, 10),
     (101, 102, 'MTD0002', 2, 12),
     (101, 102, 'MTD0003', 3, 13)],
    ['dim1', 'dim2', 'byvar', 'v1', 'v2']
)

df.show()
+----+----+-------+---+---+
|dim1|dim2|  byvar| v1| v2|
+----+----+-------+---+---+
| 101| 102|MTD0001|  1| 10|
| 101| 102|MTD0002|  2| 12|
| 101| 102|MTD0003|  3| 13|
+----+----+-------+---+---+

result = df.selectExpr('dim1', 
                       'dim2', 
                       'byvar', 
                       "stack(2, 'v1', v1, 'v2', v2) as (names, values)")
result.show()
+----+----+-------+-----+------+
|dim1|dim2|  byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001|   v1|     1|
| 101| 102|MTD0001|   v2|    10|
| 101| 102|MTD0002|   v1|     2|
| 101| 102|MTD0002|   v2|    12|
| 101| 102|MTD0003|   v1|     3|
| 101| 102|MTD0003|   v2|    13|
+----+----+-------+-----+------+

If we want dynamically set the columns to stack we just need to set the unaltered columns, in your example are dim1, dim2 and byvar and create the stack sentence using a for-loop.

# set static columns
unaltered_cols = ['dim1', 'dim2', 'byvar']
# extract columns to stack
change_cols = [n for n in df.schema.names if not n in unaltered_cols]
cols_exp = ",".join(["'" + n + "'," + n for n in change_cols])
# create stack sentence
stack_exp = "stack(" + str(len(change_cols)) +',' + cols_exp + ") as (names, values)"
# print final expression
print(stack_exp)
# --> stack(2,'v1',v1,'v2',v2) as (names, values)

# apply transformation
result = df.selectExpr('dim1', 
                       'dim2', 
                       'byvar', 
                       stack_exp)
result.show()
+----+----+-------+-----+------+
|dim1|dim2|  byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001|   v1|     1|
| 101| 102|MTD0001|   v2|    10|
| 101| 102|MTD0002|   v1|     2|
| 101| 102|MTD0002|   v2|    12|
| 101| 102|MTD0003|   v1|     3|
| 101| 102|MTD0003|   v2|    13|
+----+----+-------+-----+------+

If we run the same code but with a different dataframe, you will get the desired result.

df = spark.createDataFrame(
    [(101, 102, 'MTD0001', 1, 10, 4),
     (101, 102, 'MTD0002', 2, 12, 5),
     (101, 102, 'MTD0003', 3, 13, 5)],
    ['dim1', 'dim2', 'byvar', 'v1', 'v2', 'v3']
)
# Re-run the code to create the stack_exp before!
result = df.selectExpr('dim1', 
                       'dim2', 
                       'byvar', 
                       stack_exp)
result.show()
+----+----+-------+-----+------+
|dim1|dim2|  byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001|   v1|     1|
| 101| 102|MTD0001|   v2|    10|
| 101| 102|MTD0001|   v3|     4|
| 101| 102|MTD0002|   v1|     2|
| 101| 102|MTD0002|   v2|    12|
| 101| 102|MTD0002|   v3|     5|
| 101| 102|MTD0003|   v1|     3|
| 101| 102|MTD0003|   v2|    13|
| 101| 102|MTD0003|   v3|     5|
+----+----+-------+-----+------+

pyspark dataframe modifying columns

2 Answers