I am trying to do binning on a particular column in dataframe based on the data given in the dictionary.
Below is the dataframe I used:
df
SNo,Name,CScore 1,x,700 2,y,850 3,z,560 4,a,578 5,b,456 6,c,678
I have created the below function,it is working fine if I use it seperately.
def binning(column,dict): finalColumn=[] lent = len(column) for i in range(lent): for j in range(len(list(dict))): if( int(column[i]) in range(list(dict)[j][0],list(dict)[j][1])): finalColumn.append(dict[list(dict)[j]]) return finalColumn
I have used the above function in the below statement.
newDf = df.withColumn("binnedColumn",binning(df.select("CScore").rdd.flatMap(lambda x: x).collect(),{(1,400):'Low',(401,900):'High'}))
I am getting the below error:
Traceback (most recent call last): File "", line 1, in File "C:\spark_2.4\python\pyspark\sql\dataframe.py", line 1988, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column
Any help to solve this issue will be of great help.Thanks.
binning
into a user defined function (udf
). Also don't name your variablesdict
. Also callingcollect()
insidewithColumn
is going to give you terrible performance.You can probably achieve the same result usingwhen()
andbetween()
. – pault