7
votes

I use PySpark.

Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.

I've tried the following:

output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))

but I get the error that 'col should be Column'.

Any suggestions on how to transform a column of vectors into columns of its values?

4

4 Answers

3
votes

I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.

The following code works:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
3
votes

Got the same problem, below is the code adjusted for the situation when you have n-length vector.

splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out =  tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])
2
votes

You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:

from pyspark.sql.functions import udf, col

split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
                                     split2_udf(col("probability")).alias("c2"))

This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.

0
votes

I tried @Rookie Boy 's loop but it seems the splits udf loop doesn't work for me. I modified a bit.

out = df
for i in range(len(n)):
    splits_i = udf(lambda x: x[i].item(), FloatType())
    out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()