I would like to make a pyspark dataframe from a NxM numpy matrix. This dataframe should have N rows but only 1 column that contains array data of size (1xM).
I have tried converting the NxM numpy matrix to a pandas dataframe. However, the original matrix size is big (1M x 2000) with further downstream operations and I will only be able to work if I could create a pyspark dataframe for the numpy matrix.
for example
I would like to convert the below matrix
m = np.array([[1, 2], [11, 22], [111, 222])
to a pyspark dataframe that looks like
+-----+----------+
|index| array |
+-----+----------+
| 0| [1, 2]|
| 2| [11, 22]|
| 3|[111, 222]|
+-----+----------+
spark.createDataFrame(enumerate(m.tolist()), ["index", "array"])
should work. Might have to map the values inm
to a different dataType first. – pault