Convert a list of dictionaries into pyspark dataframe

Question

I have a list of dictionaries looks like the following. Every dictionary is a list item.

my_list= [{"_id":1,"name":"xxx"},
    {"_id":2,"name":"yyy"},
    {"_id":3,"_name":"zzz"}]

I am trying to convert the list into a pyspark dataframe, with every dictionary being a row.

from pyspark.sql.types import StringType

df = spark.createDataFrame(my_list, StringType())

df.show()

My ideal result is the following:

+-----------------------------------------+
|                                    dic|
+-----------------------------------------+
|{"_id":1,"name":"xxx"}|
|{"_id":2,"name":"yyy"}|
|{"_id":3,"_name":"zzz"}|
+-----------------------------------------+

But I got the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 95, 10.0.16.11, executor 0): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

What's wrong with my code?

mck mck · Accepted Answer · 2021-02-18T20:03:15

Spark might have difficulty in casting the Python dictionaries to strings. You can convert the dictionaries to strings before creating a dataframe:

df = spark.createDataFrame([str(i) for i in my_list], StringType())

Convert a list of dictionaries into pyspark dataframe

2 Answers