0
votes

Disclaimer: I'm a beginner when it comes to Pyspark.

For each cell in a row, I'd like to apply the following function

new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)

At the very end, I'd like the range of values to go from 0.0 to 1.0.

Here are the details of my dataframe:

  • Dimensions: (6.5M, 2905)
  • Dtypes: Double

Initial DF:

+-----+-------+-------+-------+
|.  id|  col_1|  col_2| col_n |
+-----+-------+-------+-------+ 
|    1|    7.5|    0.1|    2.0|
|    2|    0.3|    3.5|   10.5|
+-----+-------+-------+-------+

Updated DF:

+-----+-------+-------+-------+
|.  id|  col_1|  col_2| col_n |
+-----+-------+-------+-------+ 
|    1|    1.0|  0.013|   0.26|
|    2|  0.028|   0.33|    1.0|
+-----+-------+-------+-------+

Any help would be appreciated.

1
sample, and expected output.Lamanus
@lamanus ah, sorry about that. addedMadhav Thaker

1 Answers

0
votes

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.

cols = df.columns[1:]

import builtins as p

df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \

for c in cols:
    df2 = df2.withColumn(c, col(c) / col('max'))
    
df2.show()

+---+-------------------+--------------------+-------------------+----+
| id|              col_1|               col_2|              col_n| max|
+---+-------------------+--------------------+-------------------+----+
|  1|                1.0|0.013333333333333334|0.26666666666666666| 7.5|
|  2|0.02857142857142857|  0.3333333333333333|                1.0|10.5|
+---+-------------------+--------------------+-------------------+----+