1
votes

I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:

u'[23,4,77,890,455,................]'

. How can I convert this unicode to list.That is my output should be

[23,4,77,890,455...................]

. I have apply it for each element in the 'X' column.

I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error

"Malformed String"

I also tried

df.withColumn("X_new", json.loads(x)) and got the error "Expected String or Buffer"

and

df.withColumn("X_new", json.dumps(x)) which says JSON not serialisable.

and also

df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no attribute encode.

I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark

Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?

3
Is it 890,455 or 890.455?cph_sto
How do we know whether 455 is part of decimal or just another number? If comma is your delimiter, then python or for that matter any language has no way of knowing whether the next number has to be interpreted as decimal or proper number. You must specify some condition to differentiate decimal comma (European format) from other delimiter comma.cph_sto

3 Answers

2
votes
import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
|             list|  A|
+-----------------+---+
|[23,4,77,890.455]| 10|
|   [11,2,50,1.11]| 20|
|  [10.05,1,22.04]| 30|
+-----------------+---+    

# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
|                list|  A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
|   [11, 2, 50, 1.11]| 20|
|   [10.05, 1, 22.04]| 30|
+--------------------+---+

Extension of the Q, as asked by OP -

# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
|                list|  A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10|          4|
|   [11, 2, 50, 1.11]| 20|          4|
|   [10.05, 1, 22.04]| 30|          3|
+--------------------+---+-----------+
1
votes

Since it's a string, you could remove the first and last characters: From '[23,4,77,890,455]' to '23,4,77,890,455' Then apply the split() function to generate an array, taking , as the delimiter.

-2
votes

Please use the below code to ignore unicode

df.rdd.map(lambda x: x.encode("ascii","ignore"))