convert numpy (NxM) matrix into (Nx1) pyspark dataframe with N rows and 1 column with array data

Question

I would like to make a pyspark dataframe from a NxM numpy matrix. This dataframe should have N rows but only 1 column that contains array data of size (1xM).

I have tried converting the NxM numpy matrix to a pandas dataframe. However, the original matrix size is big (1M x 2000) with further downstream operations and I will only be able to work if I could create a pyspark dataframe for the numpy matrix.

for example

I would like to convert the below matrix

m = np.array([[1, 2], [11, 22], [111, 222])

to a pyspark dataframe that looks like

+-----+----------+
|index|    array |
+-----+----------+
|    0|    [1, 2]|
|    2|  [11, 22]|
|    3|[111, 222]|
+-----+----------+

Pretty sure something like spark.createDataFrame(enumerate(m.tolist()), ["index", "array"]) should work. Might have to map the values in m to a different dataType first. — pault

pault pault · Accepted Answer · 2019-07-02T13:58:05

As stated previously in my comment, you can achieve the desired result using enumerate:

m = np.array([[1, 2], [11, 22], [111, 222]]) 
df = spark.createDataFrame(enumerate(m.tolist()), ["index", "array"])
df.show()
#+-----+----------+
#|index|     array|
#+-----+----------+
#|    0|    [1, 2]|
#|    1|  [11, 22]|
#|    2|[111, 222]|
#+-----+----------+

And the corresponding schema:

df.printSchema()
#root
# |-- index: long (nullable = true)
# |-- array: array (nullable = true)
# |    |-- element: long (containsNull = true)

convert numpy (NxM) matrix into (Nx1) pyspark dataframe with N rows and 1 column with array data

1 Answers