I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?
890,455
or890.455
? – cph_sto455
is part of decimal or just another number? Ifcomma
is your delimiter, then python or for that matter any language has no way of knowing whether the next number has to be interpreted as decimal or proper number. You must specify some condition to differentiate decimal comma (European format
) from otherdelimiter
comma. – cph_sto