0
votes

I have a list of dictionaries looks like the following. Every dictionary is a list item.

my_list= [{"_id":1,"name":"xxx"},
    {"_id":2,"name":"yyy"},
    {"_id":3,"_name":"zzz"}]

I am trying to convert the list into a pyspark dataframe, with every dictionary being a row.

from pyspark.sql.types import StringType

df = spark.createDataFrame(my_list, StringType())

df.show()

My ideal result is the following:

+-----------------------------------------+
|                                    dic|
+-----------------------------------------+
|{"_id":1,"name":"xxx"}|
|{"_id":2,"name":"yyy"}|
|{"_id":3,"_name":"zzz"}|
+-----------------------------------------+

But I got the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 95, 10.0.16.11, executor 0): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

What's wrong with my code?

2

2 Answers

1
votes

Spark might have difficulty in casting the Python dictionaries to strings. You can convert the dictionaries to strings before creating a dataframe:

df = spark.createDataFrame([str(i) for i in my_list], StringType())
1
votes

You need to convert the dicts into strings before creating dataframe. However, I'd suggest you not to store values as stringfied dicts. It wouldn't be easy to parse them for further transformations later. Use JSON strings instead :

import json

df = spark.createDataFrame([[json.dumps(d)] for d in my_list], ["dict"])

df.show(truncate=False)

#+--------------------------+
#|dict                      |
#+--------------------------+
#|{"_id": 1, "name": "xxx"} |
#|{"_id": 2, "name": "yyy"} |
#|{"_id": 3, "_name": "zzz"}|
#+--------------------------+